a map of what we work on, organised around our four research threads — perception, reasoning, evaluation, action — plus the community work around them. the columns are the threads and each row is a paper; a dot marks each thread a paper belongs to, and a line links the threads a paper spans. hover to peek, click to jump to the paper, and filter the list below by type or thread.
columns are the research threads, each row is a paper (grouped by thread) — a dot marks each membership and a line links the threads a paper spans · hover to peek · click to jump · toggle threads in the legend
- perception — understanding chemistry from what we actually measure — spectra and characterization data, and the knowledge buried in the literature.
- reasoning — combining formal rules with the tacit heuristics expert chemists use, so models reason rather than pattern-match.
- evaluation — honestly measuring what models and agents can really do — and where they only look capable.
- action — predictions experimentalists can act on, starting from the recipes and processing conditions they control.
- community — perspectives, education, and community-building around AI for chemistry.
type
thread
Agentic AI Scientists Are Not Built For Autonomous Scientific Discovery
A growing body of work pursues AI scientists capable of end-to-end autonomous scientific discovery. This position paper argues that although they already function as co-scientists, agentic AI scientists are not built for autonomous scientific discovery. We identify the following challenges in building and deploying autonomous AI scientists: (1) Problem selection is influenced by the McNamara fallacy; (2) Agents are built on large language models (LLMs) whose training corpora omit tacit procedural and failure knowledge of laboratory practice; (3) Preference optimisation during post-training compresses output diversity toward consensus; and (4) Most scientific benchmarks measure single-turn prediction accuracy and lack feedback from physical experiments back to the computational model. These challenges are not just questions of scale and scaffolding; they require revisiting fundamental design choices. To build truly autonomous AI scientists, we recommend the use of scientific simulations as verifiers for training, the design of persistent world models that represent the shifting objectives governing real investigations, the establishment of a centralized preregistration repository for all AI-generated hypotheses, and application driven by scientific need rather than tool affordance.
AI scientists produce results without reasoning scientifically
Large language model (LLM)-based systems are increasingly deployed to conduct scientific research autonomously, yet whether their reasoning adheres to the epistemic norms that make scientific inquiry self-correcting is poorly understood. Here, we evaluate LLM-based scientific agents across eight domains, spanning workflow execution to hypothesis-driven inquiry, through more than 25,000 agent runs and two complementary lenses: (i) a systematic performance analysis that decomposes the contributions of the base model and the agent scaffold, and (ii) a behavioral analysis of the epistemological structure of agent reasoning. We observe that the base model is the primary determinant of both performance and behavior, accounting for 41.4% of explained variance versus 1.5% for the scaffold. Across all configurations, evidence is ignored in 68% of traces, refutation-driven belief revision occurs in 26%, and convergent multi-test evidence is rare. The same reasoning pattern appears whether the agent executes a computational workflow or conducts hypothesis-driven inquiry. They persist even when agents receive near-complete successful reasoning trajectories as context, and the resulting unreliability compounds across repeated trials in epistemically demanding domains. Thus, current LLM-based agents execute scientific workflows but do not exhibit the epistemic patterns that characterize scientific reasoning. Outcome-based evaluation cannot detect these failures, and scaffold engineering alone cannot repair them. Until reasoning itself becomes a training target, the scientific knowledge produced by such agents cannot be justified by the process that generated it.
also: Science News
An autonomous living database for perovskite photovoltaics
Scientific discovery is severely bottlenecked by the inability of manual curation to keep pace with exponential publication rates. This creates a widening knowledge gap. This is especially stark in photovoltaics, where the leading database for perovskite solar cells has been stagnant since 2021 despite massive ongoing research output. Here, we resolve this challenge by establishing an autonomous, self-updating living database (PERLA). Our pipeline integrates large language models with physics-aware validation to extract complex device data from the continuous literature stream, achieving human-level precision (>90%) and eliminating annotator variance. By employing this system on the previously inaccessible post-2021 literature, we uncover critical evolutionary trends hidden by data lag: the field has decisively shifted toward inverted architectures employing self-assembled monolayers and formamidinium-rich compositions, driving a clear trajectory of sustained voltage loss reduction. PERLA transforms static publications into dynamic knowledge resources that enable data-driven discovery to operate at the speed of publication.
Beyond Learning on Molecules by Weakly Supervising on Molecules
Molecular representations are inherently task-dependent, yet most pre-trained molecular encoders are not. Task conditioning promises representations that reorganize based on task descriptions, but existing approaches rely on expensive labeled data. We show that weak supervision on programmatically derived molecular motifs is sufficient. Our Adaptive Chemical Embedding Model (ACE-Mol) learns from hundreds of motifs paired with natural language descriptors that are cheap to compute, trivial to scale. Conventional encoders slowly search the embedding space for task-relevant structure, whereas ACE-Mol immediately aligns its representations with the task. ACE-Mol achieves state-of-the-art performance across molecular property prediction benchmarks with interpretable, chemically meaningful representations.
Clever Materials: When Models Identify Good Materials for the Wrong Reasons
Machine learning can accelerate materials discovery. Models perform impressively on many benchmarks. However, strong benchmark performance does not imply that a model learned chemistry. I test a concrete alternative hypothesis: that property prediction can be driven by bibliographic confounding. Across five tasks spanning MOFs (thermal and solvent stability), perovskite solar cells (efficiency), batteries (capacity), and TADF emitters (emission wavelength), models trained on standard chemical descriptors predict author, journal, and publication year well above chance. When these predicted metadata ("bibliographic fingerprints") are used as the sole input to a second model, performance is sometimes competitive with conventional descriptor-based predictors. These results show that many datasets do not rule out non-chemical explanations of success. Progress requires routine falsification tests (e.g., group/time splits and metadata ablations), datasets designed to resist spurious correlations, and explicit separation of two goals: predictive utility versus evidence of chemical understanding.
Condition-aware prediction of copolymer architecture
The architecture of a copolymer determines how the material behaves. Designing a polymer for a target property, therefore, should begin with designing for a target architecture. Yet no method predicts the architecture a given monomer pair will react to under given conditions. Even though reactivity ratios have been known since Mayo and Lewis to depend on solvent, temperature, and polymerization mechanism, every predictive approach, from the Q–e scheme of Alfrey and Price to recent machine-learning models, takes monomer structure alone as input. Measurements for thousands of monomer pairs exist but are scattered across eight decades of literature, much of it predating machine-readable publishing and surviving only as scanned page images. We address this gap with PolyCARP, a classifier that predicts copolymer architecture from monomer pair and reaction conditions before any synthesis, trained on a database we extracted from the literature. A vision–language-model pipeline parses 1,206 publications, yielding 3,791 copolymerizations annotated with reactivity ratios, solvent, temperature, and polymerization mechanism. This, to our knowledge, is the first dataset at this scale to record reaction conditions per entry. For any monomer pair and set of conditions, PolyCARP returns an architecture class and the nearest experimental analogues in the database. We validate the tool against a literature case study, where it captures solvent-driven architectural transitions, and against three prospective laboratory copolymerizations. The database, model, and code are openly released—enabling the community to extend both data and models, and to make more of polymer chemistry predictable.
End-to-end multimodal structure elucidation from raw spectra combining contrastive learning and evolutionary algorithms
Elucidating molecular structures from spectroscopic data remains one of chemistry’s most fundamental challenges, typically requiring extensive expert knowledge and manual interpretation of multiple analytical techniques. This is because the structure elucidation problem often has degenerate solutions for a limited set of experimental data. Existing computational approaches are limited to single spectroscopic modalities, require extensive manual preprocessing, and lack the confidence estimates and context necessary for practical application. Here we present , a framework that combines contrastive learning with evolutionary algorithms to automate structure elucidation directly from raw, multimodal spectroscopic data. By aligning embeddings across NMR, infrared, and mass spectrometry, mimics how experts use multiple spectroscopic lenses while providing calibrated confidence scores and relevant database context. On challenging molecular identification tasks, matches expert chemist performance in head-to-head comparisons in a pilot study. The system successfully identifies incorrect structure assignments in published literature and adapts to new chemical domains without retraining by updating its reference database. Our approach demonstrates how synergistic combination of machine learning paradigms can solve analytical bottlenecks that have constrained chemical discovery.
General-Purpose Models for the Chemical Sciences: LLMs and Beyond
Data-driven techniques have a large potential to transform and accelerate the chemical sciences. However, chemical sciences also pose the unique challenge of very diverse, small, fuzzy data sets that are difficult to leverage in conventional machine learning approaches. A new class of models, which can be summarized under the term general-purpose models (GPMs) such as large language models, has shown the ability to solve tasks they have not been directly trained on, and to flexibly operate with low amounts of data in different formats. In this review, we discuss the fundamental building principles of GPMs and review recent and emerging applications of those models in the chemical sciences. While many of these applications are still in the prototype phase, we expect that the increasing interest in GPMs will make many of them mature in the coming years.
Predicting acetalated dextran nanoparticle features: Controlled synthesis, formulation, and testing in a high-throughput process
The hydrophobic, pH-labile, and biodegradable acetalated dextran (Ac(e)Dex) is a promising material for drug delivery in nanomedicine. However, fundamental knowledge about the structure-property relationships is still missing, which hinders its application in preclinical and clinical trials. In this study, we synthesized a library of 36 Ac(e)Dex derivatives with different molar masses, types, and degrees of functionalization. A high-throughput formulation screening (> 1000 formulations) was conducted using a liquid handling robot optimizing the concentration of polymer, the solvent, and the addition of additives. Selected formulations were scaled up and evaluated for their stability. To further correlate polymer properties with stability, a machine learning (ML) model was developed, providing a predictive tool for Ac(e)Dex nanoparticle degradation based on synthesis/formulation data. The novelty of this work lies in the integrated synthesis-to-prediction pipeline combining controlled polymer synthesis, high-throughput formulation, and ML-based stability modeling, rather than introducing new chemical mechanisms. By eludicating how structural parameters (molar mass, type, and degree of functionalization) influence formulation properties (i.e., size, dispersity, repeatability) and particle stability, this work enables standardized comparisons of Ac(e)Dex between different studies and supports its future preclinical development.
Reducing cross-sample prediction churn in scientific machine learning
Scientific machine learning reports predictive performance. It does not report whether the same prediction would survive a different draw of training data. Across $9$ chemistry benchmarks, two classifiers trained on independent bootstraps of the same training set agree on aggregate accuracy to within $1.3\text{--}4.2$ percentage points but disagree on the class label of $8.0\text{--}21.8\%$ of test molecules. We call this gap \emph{cross-sample prediction churn}. The standard parameter-side techniques (deep ensembles, MC dropout, stochastic weight averaging) do not reduce this gap; two data-side methods do. The first is $K$-bootstrap bagging, which cuts the rate $40\text{--}54\%$ on every dataset at no accuracy cost ($K{\times}$-ERM compute). The second is \emph{twin-bootstrap}, our proposal: two networks trained jointly on independent bootstraps with a sym-KL consistency loss between their predictions, which at matched $2{\times}$-ERM compute reduces churn a further median $45\%$ beyond bagging-$K{=}2$. Cross-sample prediction churn deserves a column alongside predictive performance in scientific-ML benchmark reports, because without it the parameter-side and data-side methods are indistinguishable on the metric they actually differ on.
Semantic Content Determines Algorithmic Performance
Counting should not depend on what is being counted; more generally, any algorithm's behavior should be invariant to the semantic content of its arguments. We introduce WhatCounts to test this property in isolation. Unlike prior work that conflates semantic sensitivity with reasoning complexity or prompt variation, WhatCounts is atomic: count items in an unambiguous, delimited list with no duplicates, distractors, or reasoning steps for different semantic types. Frontier LLMs show over 40% accuracy variation depending solely on what is being counted - cities versus chemicals, names versus symbols. Controlled ablations rule out confounds. The gap is semantic, and it shifts unpredictably with small amounts of unrelated fine-tuning. LLMs do not implement algorithms; they approximate them, and the approximation is argument-dependent. As we show with an agentic example, this has implications beyond counting: any LLM function may carry hidden dependencies on the meaning of its inputs.
A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists
Large language models (LLMs) have gained widespread interest owing to their ability to process human language and perform tasks on which they have not been explicitly trained. However, we possess only a limited systematic understanding of the chemical capabilities of LLMs, which would be required to improve models and mitigate potential harm. Here we introduce ChemBench, an automated framework for evaluating the chemical knowledge and reasoning abilities of state-of-the-art LLMs against the expertise of chemists. We curated more than 2,700 question–answer pairs, evaluated leading open- and closed-source LLMs and found that the best models, on average, outperformed the best human chemists in our study. However, the models struggle with some basic tasks and provide overconfident predictions. These findings reveal LLMs’ impressive chemical capabilities while emphasizing the need for further research to improve their safety and usefulness. They also suggest adapting chemistry education and show the value of benchmarking frameworks for evaluating LLMs in specific domains.
Are large language models superhuman chemists?
The findings reveal that leading LLMs outperform expert chemists in chemical knowledge and reasoning across diverse topics, while highlighting critical limitations that require further development.
Assessment of fine-tuned large language models for real-world chemistry and material science applications
We studied the performance of fine-tuning open-source LLMs for a range of different chemical questions. We benchmark their performances against “traditional” machine learning models and find that, in most cases, the fine-tuning approach is superior.
ChemPile: A 250GB Diverse and Curated Dataset for Chemical Foundation Models
Foundation models have shown remarkable success across scientific domains, yet their impact in chemistry remains limited due to the absence of diverse, large-scale, high-quality datasets that reflect the field's multifaceted nature. We present the ChemPile, an open dataset containing over 75 billion tokens of curated chemical data, specifically built for training and evaluating general-purpose models in the chemical sciences. The dataset mirrors the human learning journey through chemistry -- from educational foundations to specialized expertise -- spanning multiple modalities and content types including structured data in diverse chemical representations (SMILES, SELFIES, IUPAC names, InChI, molecular renderings), scientific and educational text, executable code, and chemical images. ChemPile integrates foundational knowledge (textbooks, lecture notes), specialized expertise (scientific articles and language-interfaced data), visual understanding (molecular structures, diagrams), and advanced reasoning (problem-solving traces and code) -- mirroring how human chemists develop expertise through diverse learning materials and experiences. Constructed through hundreds of hours of expert curation, the ChemPile captures both foundational concepts and domain-specific complexity. We provide standardized training, validation, and test splits, enabling robust benchmarking. ChemPile is openly released via HuggingFace with a consistent API, permissive license, and detailed documentation. We hope the ChemPile will serve as a catalyst for chemical AI, enabling the development of the next generation of chemical foundation models.
From text to insight: large language models for chemical data extraction
Large language models (LLMs) allow for the extraction of structured data from unstructured sources, such as scientific papers, with unprecedented accuracy and performance.
also: matextract.pub
MOFChecker: a package for validating and correcting metal–organic framework (MOF) structures
MOFChecker, a package for MOF duplicate detection, geometric and charge error checking, and structure correction.
Perspective on artificial intelligence for accelerated materials design (AI4Mat) workshops in 2024
The intersection of artificial intelligence and materials science has become increasingly interconnected, driving ambitious research initiatives across both fields. Since 2022, the AI for accelerated materials design (AI4Mat) workshops have provided a leading venue for showcasing cutting-edge advances in this emerging interdisciplinary domain while fostering critical discussions about the most pressing scientific and technical challenges. In 2024, AI4Mat hosted workshops at BOKU University and NeurIPS 2024, attracting researchers and practitioners from academia, industry, and government institutions worldwide. These workshops explored diverse research areas currently shaping the field, with participants engaging in comprehensive discussions that addressed the intersection’s most significant challenges from scientific, technical, and commercial perspectives. Through this holistic approach, AI4Mat’s 2024 workshops successfully illuminated the multifaceted nature of AI-driven materials research, highlighting both current achievements and future opportunities in this rapidly evolving field. In this article, the AI4Mat-2024 organizing committee presents key insights from our workshops and community discussions, outlining critical challenges in this emerging field while summarizing the latest advances in AI-accelerated materials design. We examine persistent challenges around data creation and reproducibility, alongside the growing commercial interest in developing new markets and optimization materials production processes at scale. The article also highlights significant research breakthroughs showcased at AI4Mat, including the application of large language models to accelerate materials science tasks, the development of sophisticated generative models for materials discovery, and the growing demand for interpretable AI methodologies that provide transparent insights into materials behavior.
PolyMetriX: an ecosystem for digital polymer chemistry
Digital polymer chemistry leverages computational methods to design and optimize polymer materials. While there have been advances in using machine learning to accelerate the design of polymers, the field is hampered by the lack of standards, which precludes comparability and makes it difficult to build on top of prior work. To address this gap, we introduce PolyMetriX, an open-source Python library designed to facilitate the entire polymer informatics workflow—from obtaining data to training models. PolyMetriX provides curated polymer property datasets, and novel featurization techniques that extract hierarchical structural information at the full polymer, backbone, and sidechain levels. Additionally, it incorporates polymer-specific data splitting strategies to ensure robust model generalization. PolyMetriX enhances the predictive performance of models while improving reproducibility in digital polymer chemistry.
Probing the limitations of multimodal language models for chemistry and materials research
Recent advancements in artificial intelligence have sparked interest in scientific assistants that could support researchers across the full spectrum of scientific workflows, from literature review to experimental design and data analysis. A key capability for such systems is the ability to process and reason about scientific information in both visual and textual forms—from interpreting spectroscopic data to understanding laboratory set-ups. Here we introduce MaCBench, a comprehensive benchmark for evaluating how vision language models handle real-world chemistry and materials science tasks across three core aspects: data extraction, experimental execution and results interpretation. Through a systematic evaluation of leading models, we find that although these systems show promising capabilities in basic perception tasks—achieving near-perfect performance in equipment identification and standardized data extraction—they exhibit fundamental limitations in spatial reasoning, cross-modal information synthesis and multi-step logical inference. Our insights have implications beyond chemistry and materials science, suggesting that developing reliable multimodal AI scientific assistants may require advances in curating suitable training data and approaches to training those models.
also: research briefing
Real AI advances require collaboration
This work proposes a framework for collaboration between artificial intelligence and domain experts to genuinely accelerate discovery and finds that current practices tend to reward incremental improvements in benchmark performance.
Vision language models excel at perception but struggles with scientific reasoning
Evaluation of leading VLMs reveals that they excel at basic scientific tasks such as equipment identification, but struggle with spatial reasoning and multistep analysis — a limitation for autonomous scientific discovery.
Less can be more for predicting properties with large language models
Predicting properties from coordinate-category data -- sets of vectors paired with categorical information -- is fundamental to computational science. In materials science, this challenge manifests as predicting properties like formation energies or elastic moduli from crystal structures comprising atomic positions (vectors) and element types (categorical information). While large language models (LLMs) have increasingly been applied to such tasks, with researchers encoding structural data as text, optimal strategies for achieving reliable predictions remain elusive. Here, we report fundamental limitations in LLM's ability to learn from coordinate information in coordinate-category data. Through systematic experiments using synthetic datasets with tunable coordinate and category contributions, combined with a comprehensive benchmarking framework (MatText) spanning multiple representations and model scales, we find that LLMs consistently fail to capture coordinate information while excelling at category patterns. This geometric blindness persists regardless of model size (up to 70B parameters), dataset scale (up to 2M structures), or text representation strategy. Our findings suggest immediate practical implications: for materials property prediction tasks dominated by structural effects, specialized geometric architectures consistently outperform LLMs by significant margins, as evidenced by a clear "GNN-LM wall" in performance benchmarks. Based on our analysis, we provide concrete guidelines for architecture selection in scientific machine learning, while highlighting the critical importance of understanding model inductive biases when tackling scientific prediction problems.
Leveraging large language models for predictive chemistry
Machine learning has transformed many fields and has recently found applications in chemistry and materials science. The small datasets commonly found in chemistry sparked the development of sophisticated machine learning approaches that incorporate chemical knowledge for each application and, therefore, require specialized expertise to develop. Here we show that GPT-3, a large language model trained on vast amounts of text extracted from the Internet, can easily be adapted to solve various tasks in chemistry and materials science by fine-tuning it to answer chemical questions in natural language with the correct answer. We compared this approach with dedicated machine learning models for many applications spanning the properties of molecules and materials to the yield of chemical reactions. Surprisingly, our fine-tuned version of GPT-3 can perform comparably to or even outperform conventional machine learning techniques, in particular in the low-data limit. In addition, we can perform inverse design by simply inverting the questions. The ease of use and high performance, especially for small datasets, can impact the fundamental approach to using machine learning in the chemical and material sciences. In addition to a literature search, querying a pre-trained large language model might become a routine way to bootstrap a project by leveraging the collective knowledge encoded in these foundation models, or to provide a baseline for predictive tasks.
also: MIT Technology Review · Nature News
14 examples of how LLMs can transform materials science and chemistry: a reflection on a large language model hackathon
We report the findings of a hackathon focused on exploring the diverse applications of large language models in molecular and materials science.
metadata from CrossRef, Semantic Scholar & arXiv · citation counts from Semantic Scholar · media links curated by us