Data Alignment and Duration Modelling in VITS

Hanzlíček, Zdeněk

Data Alignment and Duration Modelling in VITS

Files

TSD2024_978-3-031-70566-3.pdf (2.57 MB)

Date issued

2024

Authors

Hanzlíček, Zdeněk

Publisher

Springer International Publishing

Abstract

The paper analyses data alignment and duration modelling in the modern end-to-end speech synthesis model VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech). The standard version of VITS utilizes the MAS (Monotonic Alignment Search) procedure to align input text/phones and corresponding speech during the training procedure; the alignment is also used to obtain phoneme durations for the stochastic duration predictor training. This study analyzes the resulting MAS alignment and compares it with a reference alignment obtained by an LSTM-based phonetic segmentation system. We also examine the performance of VITS when the reference phonetic segmentation replaces the default MAS alignment. The comparison shows that while the original VITS is still slightly preferred in terms of quality, it provides a less interpretative data alignment. The duration modelling is more transparent in the modified version, allowing better duration control and modifications. The analysis has been carried out on two Czech voices.
Tento článek analyzuje zarovnání trénovacích dat a modelování trvání v moderním systému pro syntézu řeči VITS. Standardní verze VITSu používá proceduru MAS, k nalezení zarovnání mezi textem a řečí ve fází trénování. Z tohoto zarovnání rovněž vychází i trénování stochastického prediktoru trvání. Tato studie zkoumá výsledné zarovnání a porovnává jej s referenční fonetickou segmentací. Dále je porovnáváno fungování VITSu, pokud je MAS nahrazen zmíněnou fonetickou segmentací. Výsledky ukazují, že původní verze VITSu dosahuje sice mírně lepší kvality, avšak za cenu horší interpretovatelnosti a řiditelnosti trvání při generování syntetické řeči.

Subject(s)

text-to-speech synthesis, VITS, MAS, duration, syntéza řeči, VITS, MAS, trvání

Item identifier

http://hdl.handle.net/11025/60352
https://doi.org/10.1007/978-3-031-70566-3_11

Collections

Conference papers (NTIS)

Show full item record

Data Alignment and Duration Modelling in VITS

Files

Date issued

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Subject(s)

Citation

Item identifier

Collections