Leveraging Large Language Models for Exploiting ASR Uncertainty

AuthorsPranay Dighe, Yi (Siri) Su, Daniel Zheng, Yunshu Liu, Vineet Garg, Xiaochuan Niu, Ahmed Tewfik

With the help of creative prompt engineering and in-context learning, large language models (LLMs) are known to generalize well on a variety of text-based natural language processing (NLP) tasks. However, for performing well on spoken language understanding (SLU) tasks, LLMs either need to be equipped with in-built speech modality or they need to rely on speech-to-text conversion from an off-the-shelf automation speech recognition (ASR) system. In this work, we focus on the latter setup where the accuracy of LLM on SLU tasks is constrained by the accuracy of a frozen ASR system on the given speech input. Specifically, we tackle the task of speech intent classification where a high word-error-rate (WER) implies that the LLM may not have the correct textual information to understand the spoken intent. To alleviate this problem, we propose to prompt the LLM with an n-best list of ASR hypotheses instead of only the error-prone 1-best hypothesis. We first explore prompting the LLM with descriptive prompts which explain the concept of n-best lists to invoke LLM's emergent abilities to understand the task; followed by finetuning of LoRA adapters on the intent classification task. We demonstrate the efficacy of our approach on a binary device-directed speech detection task as well as on a keyword spotting task on Google speech commands dataset where systems using n-best list prompts outperform the ones using 1-best ASR outputs; thus paving way for an efficient method to exploit ASR uncertainty via LLMs for speech-based applications.

Leveraging Large Language Models for Exploiting ASR Uncertainty

Related readings and updates.

Delayed Fusion: Integrating Large Language Models into First-Pass Decoding in End-to-end Speech Recognition

Device-Directed Speech Detection for Follow-up Conversations Using Large Language Models

Discover opportunities in Machine Learning.