this list is not exhaustive and only supposed to give starting points for collaborative development of specific project ideas.

project ideas

projectlead/supervisordescription
mechanism parsing & lift trainingmartiño ríos-garcíaparse mechanism explanations from books/webpages for common reactions, then train using language interfaced finetuning (lift) to potentially improve performance of models like flower
organic data extractionmartiño ríos-garcíamultiple directions possible: extract all protocols from orgsyn (high-value data), or other organic chemistry extraction tasks
flexible polymer data extractionmara schilling-wilhelmifocus on flexible extraction pipelines rather than extraction quality improvement, leveraging strong capabilities of recent models (e.g., gemini)
chempile subset & model trainingadrian mirzacreate or contribute to building a curated subset of chempile and train a model on it
iupac-smiles bootstrappingadrian mirzabootstrapping iupac-smiles conversion (and potentially nmr bootstrapping) using verifable rewards
llm-agent for data schema updatesif we extract data, we might need to adjust the data schema on the fly or adjust synonym maps. llm-based agents should be able to implement this.
finetuning of open models for data extractionwe have now many results from extraction pipelines. based on those we should be able to perform distillation to have our “own” open models to perform data extraction
implement tooling that can count electrons in inorganic compoundssadra aghajanifor various downstream tasks - e.g., test and use of chemical heuristics - it would be great to have tooling that can reliably count electrons in inorganic compounds
llm-based materials descriptions as gnn node embeddingsin some cases, we have only “fuzzy” material descriptions. llms can provide embeddings for those that we can use in gnns
parserbenchwriting file-conversion tools is one of the most promising applications for making research-data management scale. to be able to develop those systems, we need a reliable benchmark for the parser-writing performance
text-to-dsltune models to formalize their reasoning as domain-specific language that can more formally be verified and scaled