this list is not exhaustive and only supposed to give starting points for collaborative development of specific project ideas.
project ideas
| project | lead/supervisor | description |
|---|---|---|
| mechanism parsing & lift training | martiño ríos-garcía | parse mechanism explanations from books/webpages for common reactions, then train using language interfaced finetuning (lift) to potentially improve performance of models like flower |
| organic data extraction | martiño ríos-garcía | multiple directions possible: extract all protocols from orgsyn (high-value data), or other organic chemistry extraction tasks |
| flexible polymer data extraction | mara schilling-wilhelmi | focus on flexible extraction pipelines rather than extraction quality improvement, leveraging strong capabilities of recent models (e.g., gemini) |
| chempile subset & model training | adrian mirza | create or contribute to building a curated subset of chempile and train a model on it |
| iupac-smiles bootstrapping | adrian mirza | bootstrapping iupac-smiles conversion (and potentially nmr bootstrapping) using verifable rewards |
| llm-agent for data schema updates | if we extract data, we might need to adjust the data schema on the fly or adjust synonym maps. llm-based agents should be able to implement this. | |
| finetuning of open models for data extraction | we have now many results from extraction pipelines. based on those we should be able to perform distillation to have our “own” open models to perform data extraction | |
| implement tooling that can count electrons in inorganic compounds | sadra aghajani | for various downstream tasks - e.g., test and use of chemical heuristics - it would be great to have tooling that can reliably count electrons in inorganic compounds |
| llm-based materials descriptions as gnn node embeddings | in some cases, we have only “fuzzy” material descriptions. llms can provide embeddings for those that we can use in gnns | |
| parserbench | writing file-conversion tools is one of the most promising applications for making research-data management scale. to be able to develop those systems, we need a reliable benchmark for the parser-writing performance | |
| text-to-dsl | tune models to formalize their reasoning as domain-specific language that can more formally be verified and scaled |