Data Alignment and Duration Modelling in VITS

Hanzlíček, Zdeněk

Data Alignment and Duration Modelling in VITS

dc.contributor.author	Hanzlíček, Zdeněk
dc.date.accessioned	2025-06-20T08:36:09Z
dc.date.available	2025-06-20T08:36:09Z
dc.date.issued	2024
dc.date.updated	2025-06-20T08:36:09Z
dc.description.abstract	The paper analyses data alignment and duration modelling in the modern end-to-end speech synthesis model VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech). The standard version of VITS utilizes the MAS (Monotonic Alignment Search) procedure to align input text/phones and corresponding speech during the training procedure; the alignment is also used to obtain phoneme durations for the stochastic duration predictor training. This study analyzes the resulting MAS alignment and compares it with a reference alignment obtained by an LSTM-based phonetic segmentation system. We also examine the performance of VITS when the reference phonetic segmentation replaces the default MAS alignment. The comparison shows that while the original VITS is still slightly preferred in terms of quality, it provides a less interpretative data alignment. The duration modelling is more transparent in the modified version, allowing better duration control and modifications. The analysis has been carried out on two Czech voices.	en
dc.description.abstract	Tento článek analyzuje zarovnání trénovacích dat a modelování trvání v moderním systému pro syntézu řeči VITS. Standardní verze VITSu používá proceduru MAS, k nalezení zarovnání mezi textem a řečí ve fází trénování. Z tohoto zarovnání rovněž vychází i trénování stochastického prediktoru trvání. Tato studie zkoumá výsledné zarovnání a porovnává jej s referenční fonetickou segmentací. Dále je porovnáváno fungování VITSu, pokud je MAS nahrazen zmíněnou fonetickou segmentací. Výsledky ukazují, že původní verze VITSu dosahuje sice mírně lepší kvality, avšak za cenu horší interpretovatelnosti a řiditelnosti trvání při generování syntetické řeči.	cz
dc.format	12
dc.identifier.document-number	001307848400011
dc.identifier.doi	10.1007/978-3-031-70566-3_11
dc.identifier.isbn	978-3-031-70565-6
dc.identifier.issn	0302-9743
dc.identifier.obd	43944185
dc.identifier.orcid	Hanzlíček, Zdeněk 0000-0002-4001-9289
dc.identifier.uri	http://hdl.handle.net/11025/60352
dc.language.iso	en
dc.project.ID	GA22-27800S
dc.publisher	Springer International Publishing
dc.relation.ispartofseries	27th International Conference on Text, Speech, and Dialogue, TSD 2024
dc.subject	text-to-speech synthesis	en
dc.subject	VITS	en
dc.subject	MAS	en
dc.subject	duration	en
dc.subject	syntéza řeči	cz
dc.subject	VITS	cz
dc.subject	MAS	cz
dc.subject	trvání	cz
dc.title	Data Alignment and Duration Modelling in VITS	en
dc.title	Zarovnání dat a modelování trvání v modelu VITS	cz
dc.type	Stať ve sborníku (D)
dc.type	STAŤ VE SBORNÍKU
dc.type.status	Published Version
local.files.count	1	*
local.files.size	2690386	*
local.has.files	yes	*
local.identifier.eid	2-s2.0-85204377525

Files

Original bundle

Showing 1 - 1 out of 1 results

Name:: TSD2024_978-3-031-70566-3.pdf
Size:: 2.57 MB
Format:: Adobe Portable Document Format

Download

License bundle

Showing 1 - 1 out of 1 results

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Conference papers (NTIS)