VITS, Tacotron or FastSpeech? Challenging some of the most popular synthesizers

Abstract

The paper presents a comparative study of three neural speech synthesizers, namely VITS, Tacotron$2$ and FastSpeech$2$, which belong among the most popular TTS systems nowadays. Due to their varying nature, they have been tested from several points of view, analysing not only the overall quality of the synthesized speech, but also the capability of processing either orthographic or phonetic inputs. The analysis has been carried out on two English and one Czech voices.

Description

Subject(s)

text-to-speech synthesis, VITS, FastSpeech2, Tacotron2

Citation