Controllable Neural Text-To-Speech Synthesis Using Intuitive Prosodic Features

AuthorsTuomo Raitio, Ramya Rasipuram, Dan Castellani

Modern neural text-to-speech (TTS) synthesis can generate speech that is indistinguishable from natural speech. However, the prosody of generated utterances often represents the average prosodic style of the database instead of having wide prosodic variation. Moreover, the generated prosody is solely defined by the input text, which does not allow for different styles for the same sentence. In this work, we train a sequence-to-sequence neural network conditioned on acoustic speech features to learn a latent prosody space with intuitive and meaningful dimensions. Experiments show that a model conditioned on sentence-wise pitch, pitch range, phone duration, energy, and spectral tilt can effectively control each prosodic dimension and generate a wide variety of speaking styles, while maintaining similar mean opinion score (4.23) to our Tacotron baseline (4.26).

Related readings and updates.

April 2, 2025research area Speech and Natural Language Processingconference NAACL

Spoken language understanding research to date has generally carried a heavy text perspective. Most datasets are derived from text, which is then subsequently synthesized into speech, and most models typically rely on automatic transcriptions of speech. This is to the detriment of prosody--additional information carried by the speech signal beyond the phonetics of the words themselves and difficult to recover from text alone. In this work, we…

February 18, 2022research area Human-Computer Interaction, research area Speech and Natural Language Processingconference ICASSP

Neural text-to-speech (TTS) synthesis can generate speech that is indistinguishable from natural speech. However, the synthetic speech often represents the average prosodic style of the database instead of having more versatile prosodic variation. Moreover, many models lack the ability to control the output prosody, which does not allow for different styles for the same text input. In this work, we train a non-autoregressive parallel neural TTS…

Controllable Neural Text-To-Speech Synthesis Using Intuitive Prosodic Features

Related readings and updates.

The Role of Prosody in Spoken Question Answering

Hierarchical Prosody Modeling and Control in Non-Autoregressive Parallel Neural TTS

Discover opportunities in Machine Learning.