View publication

With the help of creative prompt engineering and in-context learning, large language models (LLMs) are known to generalize well on a variety of text-based natural language processing (NLP) tasks. However, for performing well on spoken language understanding (SLU) tasks, LLMs either need to be equipped with in-built speech modality or they need to rely on speech-to-text conversion from an off-the-shelf automation speech recognition (ASR) system. In this work, we focus on the latter setup where the accuracy of LLM on SLU tasks is constrained by the accuracy of a frozen ASR system on the given speech input. Specifically, we tackle the task of speech intent classification where a high word-error-rate (WER) implies that the LLM may not have the correct textual information to understand the spoken intent. To alleviate this problem, we propose to prompt the LLM with an n-best list of ASR hypotheses instead of only the error-prone 1-best hypothesis. We first explore prompting the LLM with descriptive prompts which explain the concept of n-best lists to invoke LLM's emergent abilities to understand the task; followed by finetuning of LoRA adapters on the intent classification task. We demonstrate the efficacy of our approach on a binary device-directed speech detection task as well as on a keyword spotting task on Google speech commands dataset where systems using n-best list prompts outperform the ones using 1-best ASR outputs; thus paving way for an efficient method to exploit ASR uncertainty via LLMs for speech-based applications.

Related readings and updates.

This paper presents an efficient decoding approach for end-to-end automatic speech recognition (E2E-ASR) with large language models (LLMs). Although shallow fusion is the most common approach to incorporate language models into E2E-ASR decoding, we face two practical problems with LLMs. (1) LLM inference is computationally costly. (2) There may be a vocabulary mismatch between the ASR model and the LLM. To resolve this mismatch, we need to…
Read more
This paper was accepted at the Adaptive Foundation Models (AFM) Workshop at NeurIPS 2024. Follow-up conversations with virtual assistants (VAs) enable a user to seamlessly interact with a VA without the need to repeatedly invoke it using a keyword (after the first query). Therefore, accurate Device-Directed Speech Detection (DDSD) from the follow-up queries is critical for enabling naturalistic user experience. To this end, we explore the notion…
Read more