Machine Learning

*Where is chemistry headed with ML and AI?* *Image by MS Copilot.*

A substantial part of computational chemistry involves building mathematical models to analyse data. The Machine Learning (ML) part of our work comprises everything that is not an attempt realistically to model the processes by which the real world actually works. In jargon, this is everything that is not physics-based. Such tasks might firstly be regression, that is predicting numerical values such as solubilities. Secondly they might be classification, assigning items such as molecules to classes like “toxic” or “non-toxic”. Both regression and classification are examples of supervised ML, where the model is given the ground truth answers for a training set, learns the relationship betA substantial part of computational chemistry involves building mathematical models to analyse data. The Machine Learning (ML) part of our work comprises everything that is not an attempt realistically to model the processes by which the real world actually works. In jargon, this is everything that is not physics-based. Such tasks might firstly be regression, that is predicting numerical values such as solubilities. Secondly they might be classification, assigning items such as molecules to classes like “toxic” or “non-toxic”. Both regression and classification are examples of supervised ML, where the model is given the ground truth answers for a training set, learns the relationship between input and optimal output, and then applies this to a test set. Thirdly, they might be clustering, finding patterns in unlabelled data. In our group, we use such models to predict and calculate properties such as solubility, bioactivity and toxicity. This last category is an example of unsupervised ML, where the model seeks patterns in data rather than learning from gold standard answers.

Such modelling in fact has a long history in chemistry, dating back to the 19th century. However, for much of that time models were limited to simple linear regressions. In the latter part of the 20th century, the field developed through building QSAR (Quantitative Structure-Activity Relationship) and QSPR (ditto, but now it’s Structure-Property) models with multi-linear regression, and then onto non-linear methods. The field was usually known as chemoinformatics (or cheminfomatics, being unsure how to spell its own name). In the modern era, the sophistication of the models has increased to a point where it’s more descriptive, and certainly more widely understood, to call these techniques Machine Learning.

Because data sets in chemistry are often smaller than in other subjects, there are many problems for which is it difficult to obtain sufficient high-quality data to train the kinds of large neural networks that have come to dominate in many other areas of science. Instead, simpler ML techniques like the tree-based Random Forest and XGBoost often remain competitive within chemistry.

Over the last five years or so, ML has become very widely used across multiple areas of chemistry. Many of these uses are in the analysis and handling of experimental data, as well as those in enhancing electronic structure methods, and intermolecular potential energy functions, as well as those in chemoinformatics-type property prediction.

What about Artificial Intelligence (AI) – can we define a divide between ML and AI? Probably not a clear one. As Google DeepMind executive Mat Velloso said: “If it is written in Python, it’s probably machine learning. If it is written in PowerPoint, it’s probably AI.” While there’s clearly a substantial overlap between the categories, we tend to refer to souped up non-linear regression models as ML, but to LLMs as AI. Nonetheless, under the lid, LLMs are just large neural networks doing neural network things like optimising weights.

Some of our recent Publications in ML

Zheng, T., Mitchell, J. B. O. & Dobson, S. A., Revisiting the application of machine learning approaches in predicting aqueous solubility, ACS Omega. 9, (32), 35209-35222 (2024), https://doi.org/10.1021/acsomega.4c06163

Videla Rodriguez, E. A., Mitchell, J. B. O. & Smith, V. A., A Bayesian network structure learning approach to identify genes associated with stress in spleens of chickens, Scientific Reports, 12, (8), 7482 (2022), https://doi.org/10.1038/s41598-022-11633-7

Mitchell, J. B. O., Three machine learning models for the 2019 Solubility Challenge, ADMET & DMPK. 8, (3), 215-251 (2020), https://doi.org/10.5599/admet.835

Boobier, S., Osbourn, A. & Mitchell, J. B. O., Can human experts predict solubility better than computers? Journal of Cheminformatics. 9: 63 (2017), https://doi.org/10.1186/s13321-017-0250-y

Back to our home page

Back to our research page