View publication

Representations from models such as Bidirectional Encoder Representations from Transformers (BERT) and Hidden units BERT (HuBERT) have helped to achieve state-of-the-art performance in dimensional speech emotion recognition. Both HuBERT, and BERT models generate fairly large dimensional representations, and such models were not trained with emotion recognition task in mind. Such large dimensional representations result in speech emotion models with large parameter size, resulting in both memory and computational cost complexities. In this work, we investigate the selection of representations based on their task saliency, which may help to reduce the model complexity without sacrificing dimensional emotion estimation performance. In addition, we investigate modeling label uncertainty in the form of grader opinion variance, and demonstrate that such information can help to improve the model's generalization capacity and robustness. Finally, we analyzed the robustness of the speech emotion model against acoustic degradation and observed that the selection of salient representations from pre-trained models and modeling label uncertainty helped to improve the models generalization capacity to unseen data containing acoustic distortions in the form of environmental noise and reverberation.

Related readings and updates.

Spontaneous speech emotion data usually contain perceptual grades where graders assign emotion score after listening to the speech files. Such perceptual grades introduce uncertainty in labels due to grader opinion variation. Grader variation is addressed by using consensus grades as groundtruth, where the emotion with the highest vote is selected, and as a consequence fails to consider ambiguous instances where a speech sample may contain…
Read more
Pre-trained model representations have demonstrated state-of-the-art performance in speech recognition, natural language processing, and other applications. Speech models, such as Bidirectional Encoder Representations from Transformers (BERT) and Hidden units BERT (HuBERT), have enabled generating lexical and acoustic representations to benefit speech recognition applications. We investigated the use of pre-trained model representations for…
Read more