Effects of Large Multi-Speaker Models on the Quality of Neural Speech Synthesis
Date issued
2024
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Západočeská univerzita v Plzni
Abstract
These days, speech synthesis is usually performed by neural models (Tan et al., 2021).A neural speech synthesizer is dependent on a large number of parameters, whose values mustbe acquired during the process of model training. In many situations, the result of trainingcan be improved by fine-tuning a pre-trained model, i.e. using the parameter values of a modelwhich has been trained using different training data to initialize the parameters of the targetmodel before the training process begins (Zhang et al., 2023).In the field of speech synthesis, a pre-trained model is a speech synthesizer which hasbeen trained to synthesize the voice of another speaker. Furthermore, we can use a multi-speakerpre-trained model, which has been trained using speech recordings of multiple speakers, so itshould contain general knowledge about human speech.This paper describes how the number of speakers used to train a pre-trained model affectsthe quality of the final synthetic speech. We used a single-speaker model as well as two multispeakermodels for fine-tuning and we compared the obtained models in a listening test.
Description
Subject(s)
large multi-speaker models, neural speech synthesis