I personally don't assume any collected dataset is correct at face value. I check the original source for questionable points (e.g., they deviate from a trend, are outliers, or are regularly showing up with large errors in a trained model). Often there are transcription errors (someone forgot a decimal point or the units are wrong), missing or wrong metadata (a study annealed a sample whereas all other points are as-quenched), or the original experiment is generally untrustworthy (contaminated samples). Even if it's curated by someone with expertise in the field, collected datasets shouldn't be blindly trusted. A lot of the early successes in materials informatics came from high-throughput DFT-based datasets because they don't suffer from these issues.
------------------------------
James Saal
Director - External Research Programs
Citrine Informatics
------------------------------
Original Message:
Sent: 02-26-2023 15:10
From: Oscar Suarez
Subject: Data confidence
Regarding data science-assisted computational materials science, how confident are we in existing databases? Who is curating the data? Are these people (or experts) truly qualified for the task? How confident are we on error-traping routines of "bad" data?
------------------------------
Oscar Suarez
Professor
UNIVERSITY 0F PUERTO RICO-MAYAGUEZ
Mayaguez PR
(787) 464-6739
------------------------------