Can hospitals afford digital storage for imagery?
W.F. Cody, H.M. Gladney, et al.
SPIE Medical Imaging 1994
Audio-Visual Speech Recognition (AVSR) combines lip-based video with audio and can improve performance in noise, but most methods are trained only on English data. One limitation is the lack of large-scale multilingual video data, which makes it hard to train models from scratch. In this work, we propose mWhisper-Flamingo for multilingual AVSR which combines the strengths of a pre-trained audio model (Whisper) and video model (AV-HuBERT). To enable better multi-modal integration and improve the noisy multilingual performance, we introduce decoder modality dropout where the model is trained both on paired audio-visual inputs and separate audio/visual inputs. mWhisper-Flamingo achieves state-of-the-art WER on MuAViC, an AVSR dataset of 9 languages. Audio-visual mWhisper-Flamingo consistently outperforms audio-only Whisper on all languages in noisy conditions.
W.F. Cody, H.M. Gladney, et al.
SPIE Medical Imaging 1994
Joy Y. Cheng, Daniel P. Sanders, et al.
SPIE Advanced Lithography 2008
Donald Samuels, Ian Stobert
SPIE Photomask Technology + EUV Lithography 2007
Y.Y. Li, K.S. Leung, et al.
J Combin Optim