Can Synthetic Data Improve ML Models in Chemistry?

John Mitchell
Tuesday 10 June 2025

Selin Yucebiyik’s Honours project investigated whether Machine Learning models in chemistry could be improved by:

  • Increasing the quantity of data
  • Augmenting the dataset with synthetic data

We found that increasing the size of the training set did significantly improve the predictive power of the models. However, adding synthetic data tended to make the models worse, with the lower quality of the synthetic datapoints outweighing the effect of a larger training set.