Soft-Masked Diffusion Language Models
Michael Hersche, Samuel Moor, et al.
ICLR 2026
"Read the room", or the ability to infer others' mental states from subtle social cues, is a hallmark of human social intelligence, but remains a major challenge for current AI systems. Existing social reasoning datasets are limited in complexity, scale, and coverage of mental states, falling short of the rich causal dynamics found in real-life interactions. In this work, we introduce -Bench, an evaluation benchmark with fine-grained annotations of belief, intent, desire, emotion, and their causal chains in complex scenarios. Furthermore, we introduce -FDT, a large-scale training set generated through a novel automated pipeline with the same chain structure. We conduct a comprehensive evaluation of state-of-the-art (SOTA) large vision-language models (LVLMs) on -Bench, revealing substantial deficiencies in consistent multi-step social reasoning. We also fine-tune a 7B model on -FDT, achieving notable improvements across multiple relevant benchmarks. Our contributions are three-fold: (i) a novel benchmark with richly annotated, multi-step causal reasoning data; (ii) systematic evidence that SOTA LVLMs fall far short of human-level reasoning; (iii) a scalable training dataset that significantly enhances social reasoning performance.
Michael Hersche, Samuel Moor, et al.
ICLR 2026
James E. Gentile, Nalini Ratha, et al.
BTAS 2009
Mahesh Viswanathan, Homayoon S.M. Beigi, et al.
ICDAR 1999
Simona Rabinovici-Cohen, Naomi Fridman, et al.
Cancers