Read the Room: Video Social Reasoning with Mental-Physical Causal Chains

Lixing Niu; Jiapeng Li; Xingping Yu; Xinyi Dong; Shu Wang; Ruining Feng; Bo Wu; Ping Wei; Yisen Wang; Lifeng Fan

ICLR 2026

Conference paper

23 Apr 2026

Read the Room: Video Social Reasoning with Mental-Physical Causal Chains

Abstract

"Read the room", or the ability to infer others' mental states from subtle social cues, is a hallmark of human social intelligence, but remains a major challenge for current AI systems. Existing social reasoning datasets are limited in complexity, scale, and coverage of mental states, falling short of the rich causal dynamics found in real-life interactions. In this work, we introduce $R^3$ -Bench, an evaluation benchmark with fine-grained annotations of belief, intent, desire, emotion, and their causal chains in complex scenarios. Furthermore, we introduce $R^3$ -FDT, a large-scale training set generated through a novel automated pipeline with the same chain structure. We conduct a comprehensive evaluation of state-of-the-art (SOTA) large vision-language models (LVLMs) on $R^3$ -Bench, revealing substantial deficiencies in consistent multi-step social reasoning. We also fine-tune a 7B model on $R^3$ -FDT, achieving notable improvements across multiple relevant benchmarks. Our contributions are three-fold: (i) a novel benchmark with richly annotated, multi-step causal reasoning data; (ii) systematic evidence that SOTA LVLMs fall far short of human-level reasoning; (iii) a scalable training dataset that significantly enhances social reasoning performance.