Effects of Training Strategies and the Amount of Speech Data on the Quality of Speech Synthesis

Date issued

2024

Journal Title

Journal ISSN

Volume Title

Publisher

Springer International Publishing

Abstract

During the development of a speech synthesizer, we often face a lack of training data. This paper describes how the amount of data used to train a speech synthesizer affects the quality of the final synthetic speech. To answer this question, we trained multiple VITS synthesizers using different amounts of training data and we compared them using listening tests and the MCD objective measure. Furthermore, we compared three training strategies: training a speech synthesizer from scratch, fine-tuning a single-speaker model and fine-tuning a multi-speaker model.

Description

Subject(s)

fine-tuning, speech synthesis, training data, VITS

Citation