Poster

Assessing Chemical Foundation Models for Glycan Representation and Tasks

Abstract

Glycans, also known as carbohydrates, play a fundamental role in systems biology. These complex sugars occur in free form or linked to other biomolecules—including lipids, proteins, and RNAs—where they critically regulate processes such as protein folding, immune recognition, and cell–cell communication. Despite their importance, glycans remain underexplored in computational biology due to their branched, non-linear structures and the lack of standardized representations. In contrast, other biomolecules such as DNA, RNA, proteins or small molecules have benefitted from machine learning models that capture sequence and structural properties. Here, we present the first evaluation of current state-of-the-art chemical foundational models for glycan-specific tasks and transfer learning. Glycans represented in IUPAC-condensed notation were systematically converted to SMILES to enable compatibility with small-molecule models such as MolFormer, ChemBERTa and MAMMAL. Across four biologically relevant tasks—glycan taxonomy prediction, glycan immunogenicity prediction, glycosylation type classification, and protein–glycan interaction prediction—we evaluate static and fine-tuned embeddings based on GlycanML Benchmark (composed by SugarBase, GlyConnect and LectinOracle datasets) providing the first systematic assessment of chemical foundational models on glycan representation learning tasks. This work highlights the gaps in current models and establishes a foundation for integrating glycans into multi-modal AI frameworks,paving the way for their inclusion in large-scale biological in silico systems.