VITS: Quality vs. Speed Analysis

Date issued

2023

Journal Title

Journal ISSN

Volume Title

Publisher

Springer International Publishing

Abstract

In this paper, we analyze the performance of a modern end-to-end speech synthesis model called Variational Inference with adversarial learning for end-to-end Text-to-Speech (VITS). We build on the original VITS model and examine how different modifications to its architecture affect synthetic speech quality and computational complexity. Experiments with two Czech voices, a male and a female, were carried out. To assess the quality of speech synthesized by the different modified models, MUSHRA listening tests were performed. The computational complexity was measured in terms of synthesis speed over real time. While the original VITS model is still preferred regarding speech quality, we present a modification of the original structure with a significantly better response yet providing acceptable output quality. Such a configuration can be used when system response latency is critical.

Description

Subject(s)

neural speech synthesis, Eed-to-end modeling, variational autoencoder, VITS, speed optimization

Citation