SLiCK: Exploiting Subsequences for Length-Constrained Keyword Spotting
AuthorsKumari Nishu, Minsik Cho, Devang Naik
AuthorsKumari Nishu, Minsik Cho, Devang Naik
User-defined keyword spotting on a resource-constrained edge device is challenging. However, keywords are often bounded by a maximum keyword length, which has been largely under-leveraged in prior works. Our analysis of keyword-length distribution shows that user-defined keyword spotting can be treated as a length-constrained problem, eliminating the need for aggregation over variable text length. This leads to our proposed method for efficient keyword spotting, SLiCK (exploiting Subsequences for Length-Constrained Keyword spotting). We further introduce a subsequence-level matching scheme to learn audio-text relations at a finer granularity, thus distinguishing similar-sounding keywords more effectively through enhanced context. In SLiCK, the model is trained with a multi-task learning approach using two modules: Matcher (utterance-level matching task, novel subsequence-level matching task) and Encoder (phoneme recognition task). The proposed method improves the baseline results on a Libriphrase hard dataset, increasing AUC from 88.52 to 94.9 and reducing EER from 18.82 to 11.1.
June 12, 2023research area Speech and Natural Language Processingconference Interspeech
June 1, 2021research area Speech and Natural Language Processingconference ICASSP