Zelun Tony Zhang, Nick Von Felten, et al.
CHI 2026
Previous work demonstrated that people who rely on lipreading often prefer a frontal view of their interlocutor, but sometimes a profile view may display certain lip gestures more noticeably. This work refers to an assistive tool that receives an unconstrained video of a speaker, captured at an arbitrary view, and not only locates the mouth region but also displays augmented versions of the lips in the frontal and profile views. This is made using deep Generative Adversarial Networks (GANs) trained on several pairs of images. In the training set, each pair contains a mouth picture taken at a random angle and the corresponding picture (i.e., relative to the same mouth shape, person, and lighting condition) taken at a fixed view. In the test phase, the networks are able to receive an unseen mouth image taken at an arbitrary angle and map it to the fixed views - frontal and profile. Because building a large-scale pairwise dataset is time consuming, we use realistic synthetic 3D models for training, and videos of real subjects as input for testing. Our approach is speaker-independent, language-independent, and our results demonstrate that the GAN can produce visually compelling results that may assist people with hearing impairment.
Zelun Tony Zhang, Nick Von Felten, et al.
CHI 2026
Miriam Rateike, Brian Mboya, et al.
DLI 2025
Laura Elena Cué La Rosa, Maciel Zortea, et al.
LAGIRS 2020
Jung koo Kang
NeurIPS 2025