Zelun Tony Zhang, Nick Von Felten, et al.
CHI 2026
The goal of weakly-supervised video moment retrieval is to localize the video segment most relevant to a description without access to temporal annotations during training. Prior work uses co-attention mechanisms to understand relationships between the vision and language data, but they lack contextual information between video frames that can be useful to determine how well a segment relates to the query. To address this, we propose an efficient Latent Graph Co-Attention Network (LoGAN) that exploits fine-grained frame-by-word interactions to jointly reason about the correspondences between all possible pairs of frames, providing context cues absent in prior work. Experiments on the DiDeMo and Charades-STA datasets demonstrate the effectiveness of our approach, where we improve Recall@1 by 520% over prior weakly-supervised methods, even boasting an 11% gain over strongly-supervised methods on DiDeMo, while also using significantly fewer model parameters than other co-attention mechanisms.
Zelun Tony Zhang, Nick Von Felten, et al.
CHI 2026
Zirui Yan, Dennis Wei, et al.
ACL 2026
Miriam Rateike, Brian Mboya, et al.
DLI 2025
Rosaura G. Vidalmata, Walter J. Scheirer, et al.
WACV 2021