Evaluating Phoneme-Level Pretraining in Czech Text-to-Speech Synthesis

Vladař, Lukáš

Evaluating Phoneme-Level Pretraining in Czech Text-to-Speech Synthesis

Files

paper.pdf (1.03 MB)

Date issued

2026

Authors

Publisher

Springer

Abstract

Pretrained phoneme-level models such as Phoneme-Level BERT and XPhoneBERT have shown promising results in enhancing prosody and expressiveness in English TTS systems. However, their effectiveness in less-studied languages with different prosodic characteristics—such as Czech—remains underexplored. This paper investigates their applicability in Czech text-to-speech synthesis by evaluating PL-BERT within the StyleTTS 2 framework and XPhoneBERT within the VITS architecture. We conduct experiments under both highand and low-resource conditions using professionally read Czech news-style speech to determine the benefits of these pretrained phoneme-level models in Czech speech synthesis and to compare them to each other
Modely předtrénované na úrovni fonémů, jako např. Phoneme-Level BERT či XPhoneBERT, prokazují slibné výsledky ve zlepšování prozodie a výrazu anglických systémů TTS. Jejich přínos v méně studovaných jazycích s odlišnými prozodickými charakteristikami—např. v češtině—však zatím není příliš prozkoumán. Tento článek se zabývá jejich použitelností pro syntézu řeči v češtině, konkrétně hodnotí použití modelu PL-BERT v rámci frameworku StyleTTS2 a modelu XPhoneBERT zakomponovaného do architektury VITS. Provedli jsme experimenty při dostatečném i omezeném množství trénovacích dat reprezentovaných profesionálně čtenými zpravodajskými nahrávkami, abychom odhalili výhody těchto modelů předtrénovaných na úrovni fonémů pro českou syntézu řeči a abychom zmíněné modely porovnaly navzájem.

Subject(s)

phoneme-level pretraining, PL-BERT, XPhoneBERT, VITS, StyleTTS 2, modely předtrénované na úrovni fonémů, PL-BERT, XPhoneBERT, VITS, StyleTTS 2

Item identifier

http://hdl.handle.net/11025/67717
https://doi.org/10.1007/978-3-032-02548-7_14

Collections

Conference Papers (KKY)

Show full item record

Evaluating Phoneme-Level Pretraining in Czech Text-to-Speech Synthesis

Files

Date issued

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Subject(s)

Citation

Item identifier

Collections