I personally don't assume any collected dataset is correct at face value. I check the original source for questionable points (e.g., they deviate from a trend, are outliers, or are regularly showing up with large errors in a trained model). Often there are transcription errors (someone forgot a decimal point or the units are wrong), missing or wrong metadata (a study annealed a sample whereas all other points are as-quenched), or the original experiment is generally untrustworthy (contaminated samples). Even if it's curated by someone with expertise in the field, collected datasets shouldn't be blindly trusted. A lot of the early successes in materials informatics came from high-throughput DFT-based datasets because they don't suffer from these issues.
------------------------------
James Saal
Director - External Research Programs
Citrine Informatics
------------------------------