Sentences vs Phrases in Neural Speech Synthesis

Date issued

2024

Journal Title

Journal ISSN

Volume Title

Publisher

Springer International Publishing

Abstract

The neural network-based TTS models are usually trained and inferred on the whole sentences, or, in general, on longer chunks of speech. However, these may negatively affect the responsiveness of the TTS system in cases when latency should be kept as small as possible. We present experiments using smaller chunk lengths, namely phrases, and their impact on speech quality when various chunk length combinations are used for training and inference in the VITS synthesizer.

Description

Subject(s)

phrase, sentence, neural text-to-speech, VITS

Citation