AV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech Recognition
AuthorsAndrew Rouditchenko, Ronan Collobert, Tatiana Likhomanenko
AuthorsAndrew Rouditchenko, Ronan Collobert, Tatiana Likhomanenko
Audio-visual speech contains synchronized audio and visual information that provides cross-modal supervision to learn representations for both automatic speech recognition (ASR) and visual speech recognition (VSR). We introduce continuous pseudo-labeling for audio-visual speech recognition (AV-CPL), a semi-supervised method to train an audio-visual speech recognition (AVSR) model on a combination of labeled and unlabeled videos with continuously regenerated pseudo-labels. Our models are trained for speech recognition from audio-visual inputs and can perform speech recognition using both audio and visual modalities, or only one modality. Our method uses the same audio-visual model for both supervised training and pseudo-label generation, mitigating the need for external speech recognition models to generate pseudo-labels. AV-CPL obtains significant improvements in VSR performance on the LRS3 dataset while maintaining practical ASR and AVSR performance. Finally, using visual-only speech data, our method is able to leverage unlabeled visual speech to improve VSR.
May 12, 2023research area Speech and Natural Language Processingconference ACL
May 19, 2019research area Speech and Natural Language Processingconference ICML