Demystifying chemoinformatics - 1: Introduction

We live in an era dominated by Artificial Intelligence (AI), where every day our news feeds are flooded with AI-related updates. From AI-assisted cars to virtual assistants that can cheer you up or draft a business plan, AI tools have become integral to many aspects of our lives, including science, especially chemistry—the focus of this blog post.

Though the AI boom is relatively recent, AI methods, combined with informatics and data science, have been applied to chemistry since the 1960s, merging into a discipline known as chemoinformatics [1]. But what makes chemoinformatics so special? Why do major pharmaceutical companies establish chemoinformatics divisions? What drives the growing interest in this field, and where is it applied? This blog post aims to answer these questions.

What is chemoinformatics?

Simply put, "Chemoinformatics is the application of informatics, data science, and AI methods to solve chemical problems." This definition, though simplified, captures the essence of the field. Chemoinformatics involves storing, managing, processing, transforming, and analyzing chemical information to extract valuable knowledge.

You might wonder, "If I can derive knowledge from experimental data, why do I need specialized software?" Indeed, with a few data points, recognizing patterns isn't too challenging, and experiments can determine properties like solubility or toxicity. However, the challenge arises when dealing with tens, hundreds, thousands, or even millions of compounds. Which compounds should you choose? How do you know if they will yield the desired results? Do you need to test thousands of compounds, risking time and resources?

This is where chemoinformatics shines.

The number of possible compounds to synthesize is estimated to exceed 10⁶⁰! (Let me show you this number explicitly: > 1’000’000’000’000’000’000’000’000’000’000’000’000’000’000’000’000’000’000’000’000.) [2]

Chemoinformatics helps identify patterns and relationships between chemical structures and their properties using machine learning (ML). These relationships form models that predict properties of new compounds. Such models are known as quantitative structure-activity relationship (QSAR) or quantitative structure-property relationship (QSPR) models.

Central to many chemoinformatics methods, including QSAR modeling, is the paradigm of molecular similarity: similar compounds often have similar properties. This paradigm underpins the concept of chemical space—an abstract, infinite space populated by regions of similar compounds. Navigating this space is akin to exploring the vastness of the universe, as noted by Lipinski and Hopkins [3].

“Chemical space can be viewed as being analogous to the cosmological universe in its vastness, with chemical compounds populating space instead of stars.”

— Lipinski and Hopkins, Navigating chemical space for biology and medicine. Nature 432, 855–861 (2004)

Simplified workflow for training and applying a predictive QSAR / QSPR model. — **Figure 1.** A simplified workflow representation for training and applying a predictive model.

Comparison to other theoretical chemistry disciplines

Chemoinformatic models are recommended to be applied to the data similar to those they are derived from. These models are not fundamental laws of nature but are based on inductive learning from data patterns. As Varnek and Baskin [4] note

“Chemoinformatics considers the world too complex to be a priori described by any set of rules. The incompleteness of our knowledge changes the inference paradigm: instead of searching for exact solutions, chemoinformatics applies plausible reasoning quantified by probability theory. The rules (models) in chemoinformatics are not explicitly taken from rigorous physical models but learned inductively from the data. Thus, in inductive learning, the models are the result of generalization of patterns in the data.”

— Varnek and Baskin, Chemoinformatics as a Theoretical Chemistry Discipline. Molecular Informatics 30 (1), 20-32 (2011)

Unlike quantum chemistry and force field-based molecular modeling, which use deductive methods, chemoinformatics employs inductive methods to generalize data patterns, creating models based on these patterns rather than strict physical laws. Quantum chemistry, for instance, deals with electrons and nuclei using the Schrödinger wave equation and Density Functional Theory. This deductive approach applies general physical models to specific molecules. The force field (FF) approach, combined with classical mechanics, calculates molecular trajectories and potential energy.

Each theoretical discipline is important and serves a distinct purpose. Despite their differences, interdisciplinary approaches are emerging. Recent studies, such as those by Satoh et al. [5], explore blending quantum chemistry and chemoinformatics methods to discover new molecules and reactions.

**Figure 2.** Intelligence and process levels involved in deductive and inductive approaches.

Areas of application

The versatility of chemoinformatics makes it invaluable across various domains, saving time, materials, and human resources. It extends beyond single compounds to more complex objects, like mixtures [6] and chemical reactions [7, 8], with applications including:

Virtual screening of millions of compounds to identify the most effective ones
Chemical space visualization to better understand the data distribution
Generation of novel molecules with desired properties
Quality control of experimental data
Chemical library design
Identification of key molecular features affecting properties

**Figure 3**. Human, material and time resources commonly involved in screening and experiments.

Chemoinformatics is especially prominent in drug discovery, significantly accelerating the identification of promising molecules. A recent study [9] showed that AI-discovered molecules had an 80-90% success rate in Phase I clinical trials, compared to historical averages of 40-65%. Additionally, the high cost of compound library preparation underscores the value of chemoinformatics in resource savings. According to Goodnow [10], the average cost estimate for preparing a one-million-compound library for high-throughput screening ranges from 50 million to 5 billion USD, while the approximate cost of performing the testing of this library would range from 100’000 to 200’000 USD. By using chemoinformatics to cherry-pick, synthesize, and test a few hundred compounds from this pool, a significant portion of resources can be saved.

Beyond drug discovery [11], chemoinformatics is applied in materials science [12, 13], food science [14, 15], agriculture [16-18], chemical engineering [19-21], environmental science [22, 23], safety and toxicology [24], and more.

**Figure 4.** Chemoinformatics and its application areas.

Conclusion

Chemoinformatics is a multidisciplinary field leveraging informatics, data science, and AI to solve chemical problems and derive insights from chemical data. As technology advances, so do the tools available to chemoinformaticians, including generative AI [25] and large language models (LLMs) [26]. The growing number of computational drug design companies [27] and research publications [28, 29] underscores chemoinformatics' significance and increasing popularity.

Are you intrigued by chemoinformatics? Wondering if you have the skills to pursue this career? Spoiler alert: you do! In upcoming blog posts, I will share the diverse learning paths taken by current chemoinformaticians and provide resources to start your journey in this exciting field.

References

Gasteiger, J. The Central Role of Chemoinformatics. Chemometrics and Intelligent Laboratory Systems 2006, 82 (1–2), 200–209. https://doi.org/10.1016/j.chemolab.2005.06.022.
Dobson, C. M. Chemical Space and Biology. Nature 2004, 432 (7019), 824–828. https://doi.org/10.1038/nature03192.
Lipinski, C.; Hopkins, A. Navigating Chemical Space for Biology and Medicine. Nature 2004, 432 (7019), 855–861. https://doi.org/10.1038/nature03193.
Varnek, A.; Baskin, I. I. Chemoinformatics as a Theoretical Chemistry Discipline. Molecular Informatics 2011, 30 (1), 20–32. https://doi.org/10.1002/minf.201000100.
Satoh, H.; Steiner, V.-M.; Hutter, J. “Quantum-Chemoinformatics” for Design and Discovery of New Molecules and Reactions. March 8, 2024. https://doi.org/10.26434/chemrxiv-2024-808lg.
Muratov, E. N.; Varlamova, E. V.; Artemenko, A. G.; Polishchuk, P. G.; Kuz’min, V. E. Existing and Developing Approaches for QSAR Analysis of Mixtures. Molecular Informatics 2012, 31 (3–4), 202–221. https://doi.org/10.1002/minf.201100129.
Rakhimbekova, A.; Madzhidov, T. I.; Nugmanov, R. I.; Gimadiev, T. R.; Baskin, I. I.; Varnek, A. Comprehensive Analysis of Applicability Domains of QSPR Models for Chemical Reactions. IJMS 2020, 21 (15), 5542. https://doi.org/10.3390/ijms21155542.
Schwaller, P.; Vaucher, A. C.; Laplaza, R.; Bunne, C.; Krause, A.; Corminboeuf, C.; Laino, T. Machine Intelligence for Chemical Reaction Space. WIREs Comput Mol Sci 2022, 12 (5), e1604. https://doi.org/10.1002/wcms.1604.
Kp Jayatunga, M.; Ayers, M.; Bruens, L.; Jayanth, D.; Meier, C. How Successful Are AI-Discovered Drugs in Clinical Trials? A First Analysis and Emerging Lessons. Drug Discovery Today 2024, 29 (6), 104009. https://doi.org/10.1016/j.drudis.2024.104009.
Goodnow, R. A. The Changing Feasibility and Economics of Chemical Diversity Exploration with DNA‐Encoded Combinatorial Approaches. In A Handbook for DNA‐Encoded Chemistry; Goodnow, R. A., Ed.; Wiley, 2014; pp 417–426. https://doi.org/10.1002/9781118832738.ch18.
Pun, F. W.; Ozerov, I. V.; Zhavoronkov, A. AI-Powered Therapeutic Target Discovery. Trends in Pharmacological Sciences 2023, 44 (9), 561–572. https://doi.org/10.1016/j.tips.2023.06.010.
Yosipof, A.; Shimanovich, K.; Senderowitz, H. Materials Informatics: Statistical Modeling in Material Science. Molecular Informatics 2016, 35 (11–12), 568–579. https://doi.org/10.1002/minf.201600047.
Adams, N. Polymer Informatics. In Polymer Libraries; Meier, M. A. R., Webster, D. C., Eds.; Advances in Polymer Science; Springer Berlin Heidelberg: Berlin, Heidelberg, 2010; Vol. 225, pp 107–149. https://doi.org/10.1007/12_2009_18.
Peña‐Castillo, A.; Méndez‐Lucio, O.; Owen, J. R.; Martínez‐Mayorga, K.; Medina‐Franco, J. L. Chemoinformatics in Food Science. In Applied Chemoinformatics; Engel, T., Gasteiger, J., Eds.; Wiley, 2018; pp 501–525. https://doi.org/10.1002/9783527806539.ch10.
Martinez-Mayorga, K.; Medina-Franco, J. L. Chapter 2 Chemoinformatics—Applications in Food Chemistry. In Advances in Food and Nutrition Research; Elsevier, 2009; Vol. 58, pp 33–56. https://doi.org/10.1016/S1043-4526(09)58002-3.
Mashabela, M. D.; Masamba, P.; Kappo, A. P. Metabolomics and Chemoinformatics in Agricultural Biotechnology Research: Complementary Probes in Unravelling New Metabolites for Crop Improvement. Biology 2022, 11 (8), 1156. https://doi.org/10.3390/biology11081156.
Chen, D.; Hao, G.; Song, B. Finding the Missing Property Concepts in Pesticide-Likeness. J. Agric. Food Chem. 2022, 70 (33), 10090–10099. https://doi.org/10.1021/acs.jafc.2c02757.
Barcelos, M. P.; Da Silva, C. H. T. D. P. In Silico Approaches in Pesticides. In Trends and Innovations in Energetic Sources, Functional Compounds and Biotechnology; Taft, C. A., De Almeida, P. F., Eds.; Engineering Materials; Springer Nature Switzerland: Cham, 2024; pp 335–351. https://doi.org/10.1007/978-3-031-46545-1_17.
Creton, B. Chemoinformatics at IFP Energies Nouvelles: Applications in the Fields of Energy, Transport, and Environment. Molecular Informatics 2017, 36 (10), 1700028. https://doi.org/10.1002/minf.201700028.
Solov’ev, V. P.; Oprisiu, I.; Marcou, G.; Varnek, A. Quantitative Structure–Property Relationship (QSPR) Modeling of Normal Boiling Point Temperature and Composition of Binary Azeotropes. Ind. Eng. Chem. Res. 2011, 50 (24), 14162–14167. https://doi.org/10.1021/ie2018614.
Oprisiu, I.; Varlamova, E.; Muratov, E.; Artemenko, A.; Marcou, G.; Polishchuk, P.; Kuz’min, V.; Varnek, A. QSPR Approach to Predict Nonadditive Properties of Mixtures. Application to Bubble Point Temperatures of Binary Mixtures of Liquids. Molecular Informatics 2012, 31 (6–7), 491–502. https://doi.org/10.1002/minf.201200006.
Ljoncheva, M.; Stepišnik, T.; Džeroski, S.; Kosjek, T. Cheminformatics in MS-Based Environmental Exposomics: Current Achievements and Future Directions. Trends in Environmental Analytical Chemistry 2020, 28, e00099. https://doi.org/10.1016/j.teac.2020.e00099.
Lai, A. Cheminformatics and Computational Approaches for Identifying and Managing Unknown Chemicals in the Environment. 2022.
Chemometrics and Cheminformatics in Aquatic Toxicology, 1st ed.; Roy, K., Ed.; Wiley, 2021. https://doi.org/10.1002/9781119681397.
Gangwal, A.; Lavecchia, A. Unleashing the Power of Generative AI in Drug Discovery. Drug Discovery Today 2024, 29 (6), 103992. https://doi.org/10.1016/j.drudis.2024.103992.
M. Bran, A.; Cox, S.; Schilter, O.; Baldassari, C.; White, A. D.; Schwaller, P. Augmenting Large Language Models with Chemistry Tools. Nat Mach Intell 2024, 6 (5), 525–535. https://doi.org/10.1038/s42256-024-00832-8.
Nagra, N. S.; Bleys, J.; Champagne, D.; Devereson, A.; Macak, M. Understanding the Company Landscape in AI-Driven Biopharma R&D. Biopharma Dealmakers 2023. https://doi.org/10.1038/d43747-023-00020-4.
Prati, R. C.; Rodrigues, B. S. M.; Aragão, I.; Soares, T. A.; Quiles, M. G.; Da Silva, J. L. F. The Impact of Interdisciplinary, Gender and Geographic Distributions on the Citation Patterns of the Journal of Chemical Information and Modeling. J. Chem. Inf. Model. 2024, 64 (4), 1107–1111. https://doi.org/10.1021/acs.jcim.3c02014.
Willett, P. Commentary: The First Twelve Years of the Journal of Cheminformatics. J Cheminform 2022, 14 (1), 38. https://doi.org/10.1186/s13321-022-00617-4.