Keynote

Multi-modal Foundation Models for Spectroscopy and Lab Operations

Abstract

We present a multimodal, multitask foundation model for automated interpretation of IR and NMR spectra toward molecular structure elucidation. The model jointly ingests 1HNMR,13CNMR{^1}H-NMR, {^1}{^3}C-NMR and IR spectra (optionally alongside molecular formula information) and generates the corresponding structure as a SMILES string. To overcome the scarcity of paired experimental datasets, we pretrain at scale on simulated multimodal spectra and then finetune on a smaller set of experimental measurements. A multitask formulation (predicting from each single modality as well as from combined inputs) forces the network to learn modality-specific cues and their synergies, while remaining robust to missing modalities. Across experimental benchmarks, the approach achieves up to 96% Top-1 accuracy and performs on par with expert human chemists. We further show that incorporating unpaired spectral data can improve performance, offering a practical route to leverage heterogeneous laboratory archives. Overall, multimodal foundation models provide a scalable path to faster, more accurate, and more reproducible spectroscopic interpretation in routine chemical workflows.