Scientific exploration in the age of artificial intelligence Nature Reviews
Artificial Intelligence (AI) is increasingly being integrated into scientific discovery to augment and accelerate research, helping scientists formulate hypotheses, design experiments, collect and interpret large datasets, and gain insights that may not be possible with traditional scientific methods alone.
The past decade has seen tremendous breakthroughs in AI. These include self-supervised learning and geometric deep learning (Geometric Deep Learning): the former allows models to be trained on large amounts of unlabeled data, while the latter leverages knowledge about the structure of scientific data to improve the accuracy and efficiency of models; and generative AI methods: which can analyze various patterns of data, including images and sequences, to create small molecule designs such as drugs and proteins.
These approaches have done much to help scientists throughout the scientific process; however, despite these advances, there are still core problems that remain. Both developers and users of AI tools need to better understand when these methods need to be improved, and challenges from poor data quality and mismanagement remain.
These problems span a variety of scientific disciplines, so there is now a need to develop foundational algorithmic approaches that can facilitate scientific understanding or autonomous access to it - a key area of focus for AI innovation.
How data is collected, transformed, and understood lays the foundation for the formation of scientific insights and theories; and the rise of deep learning in the early 2010s has greatly expanded the scope and ambition of these scientific discovery processes.
Artificial intelligence (AI) is now increasingly used across scientific disciplines to integrate massive data sets, refine measurements, guide experiments, explore theoretical spaces that match the data, and provide actionable and reliable models that integrate with scientific workflows to enable autonomous discovery.
Data collection and analysis are fundamental to scientific understanding and discovery, two of the core goals of science; and quantitative methods and emerging technologies, ranging from physical instruments such as microscopes to research techniques such as bootstrapping, have long been used to achieve these goals. the introduction of digitization in the 1950s paved the way for the ubiquitous use of computers in scientific research; and since the 2010s, the rise of data science has has enabled artificial intelligence to recognize scientifically relevant patterns from large data sets, thereby providing valuable guidance.
While scientific practices and procedures vary across the stages of scientific research, the development of AI algorithms spans disciplines that have traditionally been isolated from each other: these algorithms can enhance the design and execution of scientific research. They are increasingly becoming an indispensable tool for researchers by optimizing parameters and functions, automatically collecting, visualizing, and processing data, exploring the vast space of candidate hypotheses to form theories, and generating hypotheses and estimating their uncertainties to suggest relevant experiments.
Science in the Age of Artificial Intelligence. Scientific discovery is a multifaceted process involving several interrelated stages, including hypothesis formation, experimental design, data collection, and analysis. Artificial intelligence can reshape scientific discovery by enhancing and accelerating research at each stage of the process. The principles and illustrative research presented here highlight the contribution of AI to improving scientific understanding and discovery.
Since the early 2010s, the power of AI methods has increased dramatically, aided by the availability of large datasets with the help of fast, massively parallel computing and storage hardware (graphics processing units and supercomputers), supported by new algorithms.
The latter include deep representation learning, in particular multilayer neural networks, capable of recognizing basic, compact features while solving multiple tasks in scientific problems.
- Among others, geometric deep learning has been shown to be useful for integrating scientific knowledge in the form of compact mathematical statements of physical relationships, prior distributions, constraints, and other complex descriptors (e.g., the geometry of atoms in a molecule).
- Self-supervised learning enables neural networks trained on labeled or unlabeled data to migrate learned representations to different domains with fewer labeled examples, e.g., by pre-training large base models and adapting them to solve a variety of tasks in different domains.
- In addition, generative models can estimate the underlying data distribution of complex systems and support new designs.
- Unlike other uses of AI, reinforcement learning methods = find the best strategy for the environment by exploring many possible scenarios and assigning rewards to different actions based on metrics such as the expected information gain from the experiments under consideration.
In AI-driven scientific discovery, scientific knowledge can be incorporated into AI models using appropriate inductive biases, which are assumptions about structure, symmetry, constraints and prior knowledge as compact mathematical statements. However, applying these laws may result in equations that are too complex for humans to solve even using traditional numerical methods.
An emerging approach is to incorporate scientific knowledge into AI models, including information about the underlying equations, such as the laws of physics or the molecular structure and binding principles in protein folding. This inductive bias enhances the AI model by reducing the number of training instances required to achieve the same level of accuracy and extending the analysis to the vast space of unexplored scientific hypotheses.
Utilizing AI for scientific innovation and discovery presents unique challenges compared to other areas where humans have utilized AI. One of the biggest challenges is the vastness of the hypothesis space in scientific problems, making systematic exploration unfeasible.
For example, in the field of biochemistry, there are an estimated 1060 drug-like molecules to explore. Artificial intelligence systems have the potential to revolutionize scientific workflows by speeding up the process and providing predictions that approach experimental accuracy. However, there are challenges in obtaining reliably annotated datasets for AI models, which can involve time-consuming, resource-intensive experiments and simulations. Despite these challenges, efficient, intelligent, and highly autonomous experimental design and data collection can be achieved by AI systems, which can operate under human supervision to assess, evaluate, and act on results. This capability facilitates the development of AI agents that can interact continuously in dynamic environments and, for example, make real-time decisions to navigate stratospheric balloons.
AI systems can play an important role in interpreting scientific datasets and generalizing and extracting relationships and knowledge from the scientific literature. Recent findings suggest that unsupervised linguistic AI models have the potential to capture complex scientific concepts, such as the periodic table of elements, and predict applications of functional materials years before they are discovered, suggesting that potential knowledge about future discoveries may be embedded in past publications.
The scientists applied a Skip-gram variant of Word2vec to a text corpus, which was trained to predict contextual words that appear near the target word. The results demonstrate that unsupervised methods can recommend material for functional applications years before it is discovered.
AlphaFold Generates Highly Accurate Protein Structures
Scientists have introduced a molecular simulation scheme - scalable models with quantum mechanical accuracy - based on multibody potentials and interatomic forces generated by well-designed deep neural networks trained from scratch data.
Recent advances, including the successful unraveling of the 50-year-old protein folding problem and AI-driven simulations of multi-million-particle molecular systems, have demonstrated the potential of AI to solve challenging scientific problems. However, along with the major discoveries, the emerging field of Artificial Intelligence for Science (AI4Science) also faces significant challenges. As with any new technology, the success of AI4Science depends on our ability to integrate it into everyday practice and understand its potential and limitations.
However, we need not be overly concerned about these challenges: barriers to widespread adoption of AI in scientific discovery include internal and external factors specific to each stage of the discovery process, as well as concerns about the utility and potential misuse of methods, theories, software, and hardware.
The increasing size and complexity of datasets collected by experimental platforms has led to an increasing reliance in scientific research on real-time processing and high-performance computing to selectively store and analyze data generated at high speeds.
1) Data Selection
A typical particle collision experiment generates over 100 TB of data per second. These types of scientific experiments are pushing the limits of existing data transfer and storage technologies. In these physics experiments, more than 99.99% of instrumented raw data are "background events" that must be detected and discarded in real time to control data transfer rates.
In order to identify rare events for future scientific research, deep learning methods replace pre-programmed hardware event triggers with algorithms that search for outlier signals to detect unexpected or rare phenomena that may have been missed during compression. The background process can be modeled using deep autoencoder generation. The autoencoder returns higher loss values (anomaly scores) for previously unseen signals that are not part of the background distribution (rare events).
Unlike supervised anomaly detection, unsupervised anomaly detection does not require annotation and has been widely used in physics, neuroscience, earth science, oceanography and astronomy.
2) Data annotation
Training supervised models requires datasets with annotated labels that provide supervised information to guide model training and estimate functional or conditional distributions of target variables based on inputs.
Pseudo-annotation and label propagation are attractive alternatives to laborious data annotation, requiring only a small fraction of accurate labels to automatically annotate massive unlabeled datasets. In biology, techniques for assigning functional and structural labels to newly characterized molecules are crucial for downstream training of supervised models, as experimentally generating labels is very difficult.
For example, despite the continuous development of next-generation sequencing technologies, less than 1% of sequenced proteins are labeled with biological functions. Another data annotation strategy is to use proxy models trained on manually labeled data to annotate unlabeled samples and use these predicted pseudo-labels to supervise downstream prediction models. In contrast, label propagation (LPR) spreads labels into unlabeled samples through a similarity graph constructed based on feature embeddings. In addition to automatic labeling, active learning identifies the most informative data points to be manually labeled or the most informative experiments to be conducted. With this approach, models can be trained with fewer expert-provided labels.
Another strategy for data labeling is to use domain knowledge to develop labeling rules.
3) Data Generation
The performance of deep learning improves as the quality, diversity and size of the training dataset increases. An effective way to create better models is to augment the training dataset by generating additional synthetic data points through automated data augmentation and deep generation models.
In addition to manually designing such data augmentation, reinforcement learning methods find an automatic data augmentation strategy: one that is both flexible and independent of downstream models. Deep generative models, including variational self-encoders, generative adversarial networks,
normalizing flows, and diffusion models, can learn the underlying data distribution and sample training points from the optimized distribution.
Generative adversarial networks have been shown to be useful for scientific imaging because they can synthesize realistic images in many areas, including particle collision events, pathology slides, chest X-rays, magnetic resonance contrasts, three-dimensional (3D) material microstructures, protein function, and gene sequences. Probabilistic programming is an emerging technique in generative modeling that represents data generating models as computer programs.
4) Data Refinement
Precision instruments such as ultra-high resolution lasers and non-invasive microscope systems can produce highly accurate results by measuring physical quantities directly or indirectly by computing real-world objects.
Artificial intelligence technology greatly improves measurement resolution, reduces noise, and eliminates errors in measuring circularity, thus achieving high accuracy that is consistent across sites (sites). Examples of applications of AI in scientific experiments include visualizing regions of space-time such as black holes, capturing physical particle collisions, improving the resolution of images of living cells, and better detecting cell types in different biological environments.
Deep convolution methods take advantage of algorithmic advances such as spectral deconvolution, flexible sparsity, and generative capabilities to transform poorly spatio-temporally resolved measurements into high-quality, super-resolved and structured images.
Denoising is an important AI task in a variety of scientific disciplines that involves distinguishing relevant signals from noise and learning how to remove the noise. Denoising autoencoders (DAEs) can project high-dimensional input data into a more compact representation of the underlying features. These autoencoders minimize the discrepancy between uncorrupted input data points and input data points reconstructed from a compressed representation of the noise-corrupted version. Other forms of distribution-learning autoencoders, such as variational autoencoders (VAEs), are also frequently used: variational autoencoders learn stochastic representations through latent self-coding, preserving the basic data features while ignoring non-essential sources of variation that may represent random noise.
For example, in single-cell genomics, autocoders optimize count-based gene activation vectors in millions of cells and are commonly used to improve protein-RNA expression analysis.
Deep learning can extract meaningful representations of scientific data at different levels of abstraction and optimize them, often through end-to-end learning to guide research. High-quality representations should preserve as much information about the data as possible while keeping it simple and easy to understand. Scientifically meaningful representations should be compact, discriminative, distinguish potential "variations", and encode underlying mechanisms that can be generalized across multiple tasks.
1) Geometric priors
The integration of geometric priors in learning representations has been shown to be effective due to the central role that geometry and structure play in the sciences. Symmetry is a widely studied concept in geometry. It can be described in terms of invariance and isometry to represent the behavior of a mathematical function such as a neural feature encoder under a set of transformations (e.g., the SE(3) group in rigid body dynamics). Important structural properties, such as the secondary structure content of a molecular system, solvent accessibility, residue compactness (residue compactness), and hydrogen bonding patterns, are independent of spatial orientation.
In scientific image analysis, objects do not change when they are translated in the image, which means that the image segmentation masks are translationally isotropic because they change equivalently when the input pixels are translated. Incorporating symmetry into the model by increasing the training samples can benefit the AI when working with limited labeled datasets (e.g., 3D ribonucleic acid and protein structures) and can improve extrapolated predictions of the inputs: since the inputs are significantly different from those encountered during model training.
2) Geometric deep learning
Graph neural networks have become the primary method for deep learning on datasets with underlying geometric and relational structure. In a broad sense, geometric deep learning involves discovering relational patterns and equipping neural network models with inductive biases through neural information transfer algorithms that explicitly utilize local information encoded in the form of graphs and transform groups.
Depending on the scientific problem, scientists have developed various graphical representations to capture complex systems. Directional edges contribute to the physical modeling of glass systems, hypergraphs with edges connecting multiple nodes are used for the understanding of chromatin structure, models trained on multimodal graphs are used to create predictive models in genomics, and sparse, irregular, and highly-relativistic graphs have been applied to many large hadron collider physics tasks, including the reconstruction of particles from detector readouts and distinguishing the physical signals from background processes.
a) Geometric deep learning integrates geometric, structural, and symmetry information of scientific data (e.g., molecules and materials) by utilizing graph and neural information transfer strategies. This approach generates latent representations (embeddings) by exchanging neural information along edges in the graph, while taking into account other geometric a priori, such as invariance and isomorphism constraints. As a result, geometric deep learning can incorporate complex structural information into deep learning models for better understanding and processing of the underlying geometric dataset. b) In order to effectively represent different samples such as satellite images, it is crucial to capture their similarities and differences. Self-supervised learning strategies (e.g., contrast learning) accomplish this goal by generating enhanced peers and aligning positive pairs while separating negative pairs. This iterative process enhances the embedding, resulting in informative latent representations and better performance in downstream prediction tasks. c) Masked language modeling effectively captures the semantics of sequence data such as natural language and biological sequences. This approach involves feeding the input masked elements into a converter block, which includes preprocessing steps such as positional encoding. The self-attention mechanism is represented by gray lines whose color intensity reflects the magnitude of the attention weights, which are combined with the representation of the non-masked input to accurately predict the masked input. This approach produces high-quality sequence representations by repeating this autocompletion process over many elements of the input.
3) Self-supervised learning
Supervised learning may not be sufficient when only a small number of labeled samples are available for model training, or when the cost of labeling data for a specific task is too high. In such cases, utilizing both labeled and unlabeled data can improve model performance and learning.
Self-supervised learning is a technique that enables a model to learn general features of a dataset without relying on explicit labeling. Effective self-supervised strategies include predicting occluded regions of an image, predicting past or future frames in a video, and using contrast learning to teach the model to distinguish between similar and dissimilar data points.
Self-supervised learning is a key preprocessing step that learns transferable features in large unlabeled datasets and then fine-tunes the model to perform downstream tasks in small labeled datasets. Such pre-trained models have a broad understanding of the scientific domain and are generalized predictors that can be applied to a wide variety of tasks; thus improving labeling efficiency beyond purely supervised methods.
4) Language Modeling
Masked language modeling is a common method for self-supervised learning of natural languages and biological sequences.
Arranging atoms or amino acids (tags) into structures to produce molecular and biological functions is analogous to letters forming words and sentences to define the meaning of a document. As natural language and biological sequence processing continue to evolve, they reinforce each other. During training, the goal is to predict the next token in the sequence, whereas in mask-based training, the self-supervised task is to recover the masked tokens in the sequence using the bidirectional sequence context.
Protein language models can encode amino acid sequences to capture structural and functional properties and assess the evolutionary fitness of viral variants. These characterizations can be used for a variety of tasks ranging from sequence design to structure prediction; when dealing with biochemical sequences, chemical language models help to efficiently explore the vast chemical space. Today, they are used to predict properties, plan multi-step syntheses, and explore chemical reaction space.
5) Transformer Architecture
Transformer is a neural architecture model that can process tagged sequences by flexibly modeling the interactions between arbitrary pairs of tokens, going beyond earlier efforts to use recurrent neural networks for sequence modeling.
Transformer dominates natural language processing and has been successfully applied to a range of problems: including seismic signal detection, modeling DNA and protein sequences, modeling the effects of sequence variation on biological function, and symbolic regression. While Transformer unifies graph neural networks and language models, Transformer's runtime and memory footprint can be quadratic with sequence length, leading to efficiency challenges in long-range modelling and linearized attention mechanisms (LAM). (LRM) and linearized attention mechanisms (LAM) to face challenges in terms of efficiency.
Therefore, unsupervised or self-supervised generative pre-training Transformer is widely used, followed by efficient fine-tuning of parameters.
6) Neural Operators
Standard neural network models may not meet the needs of scientific applications because they assume that the data discretization is fixed. This approach is not suitable for many scientific datasets collected at different resolutions and grids. In addition, data are often sampled from underlying physical phenomena (such as seismic activity or fluid flow) in a continuous domain.
Neural operators learn representations that are not affected by discretization by learning mappings between function spaces. Neural operators are guaranteed to be discretization invariant, which means they can handle any discretized input and converge to a limit as the mesh is refined. Neural operators are trained to be evaluated at any resolution without retraining. In contrast, the performance of standard neural networks degrades when the resolution of the data during deployment changes from the resolution of the data at which the model was trained.
Testable hypotheses are at the heart of scientific discovery. They can take many forms: from symbolic tables in math to molecules in chemistry and genetic variation in biology. For example, Johannes Kepler spent four years analyzing stellar and planetary data, ultimately coming up with a hypothesis that led to the discovery of patterns of planetary motion.
Artificial intelligence methods can play a role in multiple stages of this process. They can identify candidate symbolic expressions from noisy observations that lead to hypotheses; they can help design objects, such as molecules that bind to therapeutic targets or counterexamples that contradict mathematical conjectures, suggesting experimental evaluations in the lab.
In addition, AI systems can learn the Bayesian posterior distribution of a hypothesis and use it to generate hypotheses that match scientific data and knowledge.
a) High-throughput screening refers to the use of AI predictors trained on experimentally generated datasets to screen a small number of screens with desirable properties, thereby reducing the total size of the candidate pool by several orders of magnitude. This approach can utilize self-supervised learning to pre-train the predictor on a large number of unscreened objects and then fine-tune the predictor on a dataset of screened objects with labeled readings. Laboratory evaluation and uncertainty quantification can refine this approach, thereby streamlining the screening process, making it more cost-effective and time-efficient, and ultimately accelerating the identification of candidate compounds, materials, and biomolecules. b) Artificial Intelligence Navigator utilizes the rewards predicted by a reinforcement learning agent and a design criterion (e.g., Occam's razor), focusing on the most promising elements of the candidate hypotheses in a symbolic regression process. The example shown in the figure illustrates the process of reasoning about the mathematical expression for Newton's law of universal gravitation. Low-scoring search paths are shown as gray branches in the symbolic expression tree. c) An AI differentiator is an autoencoder model that maps discrete objects (e.g., compounds) to points in a differentiable continuous latent space. This space allows for optimization of objects, such as selecting compounds that maximize specific biochemical endpoints from a large chemical library. The idealized landscape map depicts the learned latent space, with darker colors indicating regions enriched with objects with higher predictive scores. By exploiting this potential space, the AI differentiator can efficiently identify objects that maximize the desired attributes shown by the red stars.
1) Black-box predictors for scientific hypotheses
Identifying promising hypotheses for scientific exploration requires efficiently examining many candidate hypotheses and selecting those that maximize downstream simulation and experimental gains.
In drug discovery, high-throughput screening can evaluate thousands to millions of molecules, and algorithms can prioritize molecules that need to be studied experimentally. Models can be trained to predict the utility of experiments, such as relevant molecular properties or symbolic formulas that match observations. However, the experimental underlying data for these predictors may not be available for many molecules. Therefore, weakly supervised learning methods can be used to train these models, using noisy, limited or imprecise supervision as the training signal. These methods can be cost-effective alternatives to annotations by human experts, expensive in silico computations, or higher-fidelity experiments.
AI methods trained on high-fidelity simulations have been used to efficiently screen large molecular libraries, such as 1.6 million organic light-emitting diode candidates and 11 billion synthetic ligand candidates. In genomics, transformer architectures trained to predict gene expression values from DNA sequences help prioritize gene variants. In particle physics, identifying intrinsically charmed quarks in protons requires screening all possible structures and fitting experimental data to each candidate structure. To further improve the efficiency of these processes, the candidate structures screened by AI can be sent to low and medium-throughput experiments, and the experimental feedback can be used to continuously improve the candidate structures. Experimental results can be fed back into the AI model through active learning and Bayesian optimization, allowing the algorithm to refine its predictions and focus on the most promising candidate structures.
AI approaches become valuable when the hypotheses involve complex objects such as molecules. In the case of protein folding, for example, AlphaFold2 can predict the 3D atomic coordinates of proteins based on their amino acid sequences with atomic-level accuracy, even for proteins whose structures are different from any of the proteins in the training dataset. This breakthrough has fueled the development of various AI-driven protein folding methods, such as RoseTTAFold. in addition to forward problems, AI methods are increasingly being used for inverse problems, which aim to understand the causal factors that produce a set of observations. Inverse problems, such as reverse folding or fixed backbone design, can use black-box predictors trained on millions of protein structures to predict amino acid sequences based on the 3D atomic coordinates of the protein backbone.
However, such black-box AI predictors require large training datasets and have limited interpretability despite reduced reliance on existing scientific knowledge.
2) Navigating the combinatorial hypothesis space
While sampling all hypotheses that match the data is daunting, a manageable goal is to find a good hypothesis, which can be formulated as an optimization problem. In contrast to traditional approaches that rely on manually designed rules, AI strategies can be used to estimate the payoff of each search and prioritize search directions with higher value. An agent trained with a reinforcement learning algorithm is usually used to learn the strategy. The agent learns to take actions in the search space that maximize a reward signal, which can be defined as reflecting the quality of the generated hypotheses or other relevant criteria.
To solve the optimization problem, the symbolic regression task can be solved using an evolutionary algorithm that generates random symbolic norms as an initial set of solutions. In each generation, the candidate solutions are changed slightly. The algorithm checks whether any modifications produce symbolic laws that are better suited to the observations than the previous solutions, and reserves the best solutions for the next generation.
However, reinforcement learning methods are gradually replacing this standard strategy. Reinforcement learning utilizes neural networks to sequentially generate mathematical expressions by adding mathematical symbols from a predefined vocabulary and using a learned strategy to decide which symbol to add next. The mathematical formula is represented as a parse tree. The learned strategy takes the parse tree as input to decide which leaf node to expand and which symbol (from the glossary) to add. Another way to solve mathematical problems using neural networks is to convert the mathematical formula into a sequence of binary symbols. The neural network strategy can then add one binary character at a time in probabilistic order. By designing a reward that measures the ability to refute a conjecture, this approach can find a refutation of a mathematical conjecture without prior knowledge of the mathematical problem.
Combinatorial optimization is also applicable to tasks such as discovering molecules with desirable drug properties, where each step of molecule design is a discrete decision-making process. In this process, the partially generated molecular graph will be used as input to a learning strategy that makes discrete choices about adding new atoms at selected locations in the molecule and which atoms to add. By executing this process iteratively, the strategy can generate a set of possible molecular structures and evaluate them based on their match with the target properties. The search space is too vast to explore all possible combinations, but reinforcement learning can effectively guide the search by prioritizing the most promising branches that are worth investigating. Reinforcement learning methods can be trained using a training objective that encourages the resulting strategy to sample from all reasonable solutions (with high payoffs), rather than focusing on only one good solution, as is the case with standard payoff maximization in reinforcement learning.
These reinforcement learning methods have now been successfully applied to a variety of optimization problems, including maximizing protein expression, planning hydroelectric power generation to reduce adverse impacts on the Amazon basin, and exploring the parameter space of particle gas pedals.
Policies learned by AI agents have anticipated actions that initially seemed unconventional but have proven effective. For example, in mathematics, supervised models can identify patterns and relationships between mathematical objects, helping to guide intuition and formulate conjectures. These analyses point to previously unknown patterns or even new models of the world. However, during model training, reinforcement learning methods may not generalize well to unseen data because the agent may fall into a local optimum after finding a sequence of valid actions. To improve the generalization ability, some exploration strategies are needed to collect a wider range of search trajectories to help the agent perform better in new and modified environments.
3) Optimizing variable hypothesis space
Scientific hypotheses usually come in the form of discrete objects, such as symbolic formulas in physics or compounds in pharmaceutical and materials science. While combinatorial optimization techniques have been successful in solving some of these problems, the differentiable space can also be used for optimization because it lends itself to gradient-based methods, which are effective in finding local optima.
In order to be able to use gradient-based optimization methods, there are two approaches that are often used:
- The first is to map discrete candidate hypotheses to points in a potentially variable space using a model such as VAE.
- The second approach is to relax discrete hypotheses into differentiable objects that can be optimized in a differentiable space. This relaxation can take different forms, such as replacing discrete variables with continuous ones, or using soft versions of the original constraints.
Symbolic regression applications in physics use the grammar VAE. these models use a context-free grammar to represent discrete symbolic expressions as parse trees and map the parse trees to variable latent spaces. Bayesian optimization is then used to optimize the latent space of symbolic laws while ensuring that the expression is syntactically valid.
In astrophysics, VAE is used to estimate gravitational wave detector parameters from pre-trained black hole waveform models. This approach is up to six orders of magnitude faster than conventional methods, making it practical to capture transient gravitational wave events; in materials science, thermodynamic rules are combined with autoencoders to devise an interpretable latent space for recognizing phase diagrams of crystalline structures; and in chemistry, models such as the Simplified Molecular Input Line Entry System (SMILES)-VAE transform SMILES strings (i.e., molecular symbols that represent a chemical structure in a molecular symbols representing chemical structures in the form of discrete series of symbols that can be easily understood by computers) into differentiable latent spaces that can be optimized using Bayesian optimization techniques. By representing molecular structures as points in the latent space, we can design differentiable objectives and optimize them using self-supervised learning to predict molecular properties based on their latent representations.
This means that we can optimize discrete molecular structures by back-propagating the gradient of an AI predictor to a continuous-valued representation of the molecular input. The decoder can transform these molecular representations into approximate counterparts of the discrete inputs, an approach that can be used in the design of proteins and small molecules.
Optimization in potential space can model potential data distributions more flexibly than mechanistic approaches in the original hypothesis space. However, extrapolating predictions by exploring sparse regions in the hypothesis space may not work well. In many scientific disciplines, the hypothesis space may be much larger than can be examined experimentally. For example, it has been estimated that there are approximately 10^60 molecules, whereas even the largest chemical libraries contain less than 10^10 molecules.
Thus, there is an urgent need for a method to efficiently search and identify high-quality candidate solutions in these largely unexplored regions.
Evaluating scientific hypotheses through experiments is crucial for scientific discovery. However, laboratory experiments can be costly and impractical. Computer simulations have emerged as a promising alternative, offering the possibility of more efficient and flexible experiments. While simulations rely on hand-crafted parameters and heuristics to mimic real-world scenarios, in contrast to physical experiments, simulations require a trade-off between accuracy and speed, which requires an understanding of the mechanisms behind them.
However, with the advent of deep learning, these challenges are being addressed by identifying and optimizing hypotheses for efficient testing, as well as empowering computer simulations to link observations to hypotheses.
1) Efficient assessment of scientific hypotheses
AI systems provide experimental design and optimization tools that can augment the traditional scientific method, reducing the number of experiments required and conserving resources. Specifically, AI systems can assist with the two basic steps of experimental testing: planning and guiding. In traditional methods, these two steps often require repeated trials that are inefficient, costly, and sometimes life-threatening. AI planning provides a systematic approach to designing experiments, optimizing experimental efficiency, and exploring uncharted territory. At the same time, AI guidance directs the experimental process toward high-yield hypotheses, allowing the system to learn from previous observations and adjust the experimental process. These AI approaches can be model-based, using simulations and prior knowledge, or model-free, based solely on machine learning algorithms.
Artificial intelligence systems can help plan experiments by optimizing resource utilization and reducing unnecessary investigations. Unlike hypothesis searching, experiment planning involves procedures and steps in the design of scientific experiments. An example of this is synthetic planning in chemistry. Synthetic planning involves finding a sequence of steps by which a target compound can be synthesized from an existing chemical. Artificial intelligence systems can design synthetic routes for desired compounds, thus reducing the need for human intervention.
Active learning is also used in materials discovery and synthesis. Active learning involves repeatedly interacting with and learning from experimental feedback to refine hypotheses. Material synthesis is a complex and resource-intensive process that requires efficient exploration of high-dimensional parameter spaces. Active learning utilizes uncertainty estimation to explore the parameter space, reducing uncertainty in as few steps as possible.
In ongoing experiments, decisions usually have to be adjusted in real time. However, this process is both difficult and error-prone when based solely on human experience and intuition. Reinforcement learning provides an alternative approach to continuously respond to changing environments, maximizing the safety and success of experiments. For example, reinforcement learning methods have been shown to be effective for magnetic control of tokamak plasmas, where algorithms interact with the tokamak simulator to optimize strategies for controlling the process. In another study, a reinforcement learning agent used real-time feedback such as wind speed and solar altitude to control a stratospheric balloon and find favorable wind currents for navigation.
In quantum physics, experimental designs need to be dynamically adapted, as the best options for future materialization of complex experiments may be counter-intuitive: reinforcement learning methods can overcome this problem by iteratively designing experiments and receiving experimental feedback. For example, reinforcement learning algorithms have been used to optimize the measurement and control of quantum systems, improving experimental efficiency and precision.
a) Fusion Control of Complex Dynamic Systems Using Artificial Intelligence: Degrave et al. developed an AI controller to regulate fusion through the magnetic field in a tokamak reactor. The AI agent receives real-time measurements of electrical voltage levels and plasma configurations and takes action to control the magnetic field and achieve experimental goals, such as maintaining a normal power supply. b) In computational simulations of complex systems, the AI system accelerates the detection of rare events, such as transitions between different conformational structures of proteins. c) Neural frameworks for the solving of partial differential equations, where the AI solver is a physically informed neural network trained to estimate the objective function f. When the expression of the differential equation is unknown (parameterized by η), the differential equation can be estimated by solving for the multi-objective loss, thus optimizing the functional form of the equation and its fit to the observation y. The objective is to optimize the functional form of the equation by solving for the multi-objective loss.
2) Using simulations to derive observables from hypotheses
Computer simulation is a powerful tool to derive observable data from hypotheses and thus evaluate hypotheses that cannot be directly verified. However, existing simulation techniques rely heavily on human understanding and knowledge of the mechanisms inherent in the system under study, which can be sub-optimal and inefficient. Artificially intelligent systems can improve the accuracy and learning efficiency of computer simulations by better fitting key parameters of complex systems, solving the differential equations that govern them, and modeling states in complex systems.
When studying complex systems, scientists often create a model that involves a parametric form, which requires domain knowledge to determine the initial symbolic expressions for the parameters. An example of this is a molecular force field, which is interpretable but has limited ability to represent various functions and requires strong inductive bias or scientific knowledge to generate. To improve the accuracy of molecular simulations, an artificial intelligence-based neural potential that fits expensive but accurate quantum mechanical data has been developed to replace traditional force fields.
In addition, uncertainty quantization has been used to localize energy barriers on high-dimensional free-energy surfaces, thereby improving the efficiency of molecular dynamics. For coarse-grained molecular dynamics, artificial intelligence models have been used to determine the extent to which the system needs to coarsen from the learned hidden complex structure, thereby reducing the computational cost of large systems. In quantum physics, neural networks, because of their flexibility and ability to fit data accurately, have replaced manually estimated symbolic forms in the parameterization of wave functions or density functions.
Differential equations are essential for modeling the spatio-temporal dynamics of complex systems. Neural solvers based on artificial intelligence integrate data and physics more seamlessly than numerical algebraic solvers. These neural solvers combine physics with the flexibility of deep learning to build neural networks on domain knowledge.
Artificial intelligence methods have been applied to solving differential equations in a variety of fields, including computational fluid dynamics, predicting the structure of glass systems, solving rigid chemical dynamics problems, and solving the Eichner equation to characterize the propagation time of seismic waves. In dynamics modeling, continuous time can be modeled by the God often differential equations. Neural networks can parameterize the solution of the Navier-Stokes equations in the space-time domain using the loss of physical information. However, standard convolutional neural networks are limited in their ability to model the fine structural features of the solutions; this problem can be solved by learning operators that utilize neural networks to model inter-function mappings. In addition, the solver must be able to adapt to different domains and boundary conditions. This can be accomplished by combining neural differential equations with graph neural networks, thus enabling arbitrary discretization through graph partitioning.
Statistical modeling is a powerful tool to provide a comprehensive quantitative description of complex systems by modeling the distribution of states in these systems. Due to its ability to capture highly complex distributions, deep generative modeling has recently become an important approach in the simulation of complex systems. A well-known example is the Boltzmann generator based on standardized flows. Standardized flow can map any complex distribution to a prior distribution (e.g., a simple Gaussian distribution) and then return it using a series of reversible neural networks. Standardized streaming, while computationally expensive (often requiring hundreds or thousands of neural layers), provides an accurate density function that enables sampling and training.
Unlike traditional simulations, normalized flow can sample directly from a priori distributions and apply computationally expensive fixed neural networks to generate equilibrium states. This enhances sampling in lattice field and canonical field theories, and improves Markov chain Monte Carlo methods - which may otherwise fail to converge due to mode mixing.
Leveraging scientific data requires the use of simulations and human expertise to build and use models. This integration opens up opportunities for scientific discovery. However, significant advances in theory, methodology, software and hardware infrastructure are needed to further increase the impact of AI across scientific disciplines.
Interdisciplinary collaboration is essential to realize a comprehensive and practical approach to advancing science through AI.
1) Practical Considerations
Scientific datasets are often not directly usable for AI analysis because of limitations in measurement techniques that produce incomplete datasets, biased or conflicting readings, and limited accessibility of datasets due to privacy and security concerns.
In addition, joint learning and encryption algorithms can be used to prevent the release of sensitive data of high commercial value into the public domain. The use of open scientific literature, natural language processing and knowledge graph technologies can facilitate literature mining, providing support for materials discovery, chemical synthesis and therapeutic science.
The use of deep learning presents complex challenges for AI-driven in-the-loop design, discovery, and evaluation. To automate scientific workflows, optimize large-scale simulation codes and manipulate instrumentation, autonomous robotic control can leverage predictions to perform experiments on high-throughput synthesis and test lines, creating self-driving labs. Early applications of generative modeling in materials discovery have shown that millions of possible materials can be identified that have desired properties and functionality and can be evaluated for their synthesizability. In chemical synthesis, AI optimizes candidate synthetic routes, and then a robot directs the chemical reaction according to the predicted synthetic route.
The actual implementation of an AI system involves complex software and hardware engineering and requires a series of interdependent steps: from data organization and processing to algorithm implementation and user and application interface design. Nuances in the implementation process can lead to dramatic changes in performance and affect the success of integrating AI models into scientific practice.
Therefore, standardization of data and models needs to be considered. AI methods may suffer from reproducibility due to the stochastic nature of model training, variations in model parameters, and constant changes in the training dataset, which are both data- and task-dependent. Standardized benchmarks and experimental designs can mitigate these issues. Another direction to improve reproducibility is through open source initiatives that release open models, datasets, and educational programs.
2) Algorithmic Innovation
Algorithmic innovation is needed to facilitate scientific understanding or autonomous acquisition of scientific understanding in order to create a foundational ecosystem that uses the most appropriate algorithms throughout the scientific process.
Although many laws of science are not universal, they are generally widely applicable. The human brain can generalize modified environments better and faster than state-of-the-art artificial intelligence. An attractive hypothesis is that this is because humans build not just an observed statistical model, but a causal model, i.e., a family of statistical models indexed by all possible interventions (e.g., a different initial state, an agent's action, or a different regime). Incorporating causality into AI is still a young field and much work remains to be done. Techniques such as self-supervised learning have great potential for scientific problems, as they can utilize large amounts of unlabeled data and transfer knowledge to low-data environments. However, current transfer learning schemes can be ad hoc, lack theoretical guidance, and are vulnerable to changes in the underlying distribution. Although initial attempts have addressed this challenge, more exploration is needed to systematically measure transferability across domains and prevent negative migration.
Furthermore, to address the challenges that concern scientists, the development and evaluation of AI methods must be conducted in real-world scenarios, such as plausible realizable synthetic pathways in drug design, and include well-calibrated uncertainty estimators to assess the reliability of the models before transitioning them to real-world implementations.
Scientific data is multimodal and includes images (e.g., black hole images in cosmology), natural language (e.g., scientific literature), time series (e.g., thermal yellowing of materials), sequences (e.g., biological sequences), graphs (e.g., complex systems), and structures (e.g., 3D protein-ligand conformations). For example, in high-energy physics, jets are aligned jets of particles produced by quarks and gluons at high energies; identifying their substructures from radiation patterns helps to find new physics. Jet substructures can be described in terms of images, sequences, binary trees, universal graphs, and tensor sets. While the use of neural networks to process images has been extensively studied, it is not sufficient to process particle images alone. Similarly, other representations of jet substructures alone do not provide a holistic integrated system view of complex systems. While integrating multimodal observations remains a challenge, the modular nature of neural networks means that different neural modules can transform different data modalities into a common vector representation.
Scientific knowledge, such as rotational isotropy in molecules, equality constraints in mathematics, disease mechanisms in biology, and multiscale structures in complex systems, can be incorporated into AI models. However, it is unclear which principles and knowledge are most helpful and practical. Since AI models require large amounts of data to fit, incorporating scientific knowledge into models can aid learning in cases where data sets are small or sparsely annotated. Therefore, research must establish principled methods for incorporating knowledge into AI models and understand the tradeoffs between domain knowledge and learning from measured data.
AI methods often operate in a black box, which means that users cannot fully explain how outputs are produced and which inputs are critical to generating them. Black-box models reduce user trust in predictions and have limited applicability in domains where model outputs must be understood before they can be applied in practice, such as human space exploration, and where predictions inform policy, such as climate science. Despite the proliferation of interpretable technologies, transparent deep learning models remain elusive. However, the ability of the human brain to synthesize high-level explanations that can persuade others, even if imperfectly, gives us hope that by modeling phenomena at a similarly high level of abstraction, future AI models will be able to provide explainable explanations that are at least as valuable as those provided by the human brain. It also suggests that the study of higher-level cognition may inspire future deep learning models to have both current deep learning capabilities and the ability to handle verbalizable abstractions, causal reasoning, and generalization from distributions.
3) Scientific Behavior and Scientific Careers
Looking ahead, the demand for AI expertise will be influenced by two forces.
First, there is the question of imminent benefits from the application of AI technology. Second, smart tools have the ability to enhance technology and create new opportunities: e.g. self-driving labs.
Second, intelligent tools have the ability to enhance the state of the art and create new opportunities: for example, examining biological, chemical, or physical processes that occur on length and time scales that are beyond the reach of experiments.
Building on these two forces, we expect the composition of research teams to change to include AI experts, software and hardware engineers, and new forms of research, software and hardware engineers, and new forms of collaboration involving all levels of government, educational institutions, and businesses.
However, the amount of computing and data required to compute these updates is enormous. As a result, large technology companies have invested heavily in computing infrastructure and cloud services. While for-profit and non-academic organizations also have access to large computing infrastructures, their computing power and computational costs are not as high. While for-profit and non-academic organizations have access to large computing infrastructures, institutions of higher education can better integrate multiple disciplines. In addition, academic institutions often have unique historical databases and measurement techniques that may not exist elsewhere but are necessary for AI4Science. These complementary assets facilitate new models of industry-academia collaboration, which in turn influence the choice of research questions. Influencing the choice of research questions.
As AI systems approach or exceed human performance, it is becoming feasible to use them as an alternative to routine laboratory work. This approach allows researchers to iteratively develop predictive models based on experimental data and select experiments to improve the models without having to manually perform laborious, repetitive tasks. To support this paradigm shift, educational programs that train scientists to design, implement, and improve laboratory work are emerging. These programs help scientists understand when it is appropriate to use AI and prevent misinterpretation of the conclusions drawn from AI analyses.
Misuse of AI tools and misinterpretation of their results can have significant negative impacts. However, AI misuse is not just a technical issue; it also depends on the motivations of those leading AI innovation and investing in AI implementation. It is critical to establish ethical review processes and responsible implementation strategies, including In addition, it is important to consider the security risks associated with AI, as it has become increasingly easy to repurpose algorithms for use in AI. Because algorithms can be adapted to a wide range of applications, they can be developed for one purpose but used for another, creating a security risk.
To utilize scientific data, it is necessary to utilize AI. Looking ahead, AI has the potential to unlock scientific discoveries that were previously out of reach.
Source:
[1]https://www.nature.com/articles/s41586-021-03819-2
[2]https://www.nature.com/articles/s41586-019-1335-8#Fig1
[3]https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.120.143001
[4]https://www.nature.com/articles/s41586-023-06221-2#Fig2