Sentences vs Phrases in Neural Speech Synthesis
Date issued
2024
Journal Title
Journal ISSN
Volume Title
Publisher
Springer International Publishing
Abstract
The neural network-based TTS models are usually trained and inferred on the whole sentences, or, in general, on longer chunks of speech. However, these may negatively affect the responsiveness of the TTS system in cases when latency should be kept as small as possible. We present experiments using smaller chunk lengths, namely phrases, and their impact on speech quality when various chunk length combinations are used for training and inference in the VITS synthesizer.
Description
Subject(s)
phrase, sentence, neural text-to-speech, VITS