Three Years of VoiceMOS Challenges: Lessons Learned by the UWB-NTIS-TTS Team

Kunešová, Marie

Three Years of VoiceMOS Challenges: Lessons Learned by the UWB-NTIS-TTS Team

dc.contributor.author	Kunešová, Marie
dc.contributor.author	Matoušek, Jindřich
dc.contributor.author	Lehečka, Jan
dc.contributor.author	Švec, Jan
dc.contributor.author	Tihelka, Daniel
dc.contributor.author	Hanzlíček, Zdeněk
dc.date.accessioned	2026-04-02T18:05:36Z
dc.date.available	2026-04-02T18:05:36Z
dc.date.issued	2025
dc.date.updated	2026-04-02T18:05:36Z
dc.description.abstract	Automatic prediction of mean-opinion scores (MOS) promises a faster, cheaper alternative to listening tests, yet robust generalization across speakers, languages, and domains remains a significant challenge. This article presents our system designs and experimental results from three years of participation in the VoiceMOS Challenges (2022–2024), covering MOS prediction for synthesized or voice-converted speech and singing voice, including out-of-domain and cross-language conditions. We evaluate six neural architectures – wav2vec 2.0, QuartzNet, CNN-RNN, LDNet, RawNet3, and HiFi-GAN – and their ensembles. Across all tasks, we find that 1) self-supervised acoustic encoders are the most consistently reliable foundation, 2) ensembling yields rapidly diminishing returns once complementary representations are covered, and 3) the diversity and balance of training data outweigh architectural complexity. Notably, the indiscriminate fusion strategy that performed well in 2022 degrades under the mismatched French TTS conditions of 2023, emphasizing the importance of out-of-domain validation. Further experiments show that carefully pruned ensembles can modestly outperform the best single model while remaining within real-time constraints. We conclude with several observations to guide the development of computationally efficient, domain-robust MOS prediction systems.	en
dc.description.abstract	Automatická predikce mean opinion score (MOS) slibuje rychlejší a levnější alternativu k poslechovým testům, avšak robustní zobecnění napříč mluvčími, jazyky a doménami zůstává významnou výzvou. Tento článek představuje naše návrhy systémů a experimentální výsledky z tříleté účasti v soutěžích VoiceMOS Challenge (2022–2024), které se týkaly predikce MOS pro syntetizovanou nebo hlasově převedenou řeč a zpěv, včetně out-of-domain podmínek a mezi jazyky. Hodnotíme šest neurálních architektur - wav2vec 2.0, QuartzNet, CNN-RNN, LDNet, RawNet3 a HiFi-GAN - a jejich kombinace. Napříč všemi úkoly zjišťujeme, že 1) akustické enkodéry trénovaný samoučením jsou nejspolehlivějším základním přístupem, 2) kombinace více modelů přináší rychle klesající přínosy, jakmile jsou pokryty komplementární reprezentace, a 3) rozmanitost a vyváženost trénovacích dat převažuje nad architektonickou složitostí. Strategie nediskriminační fúze, která v roce 2022 fungovala dobře, za odlišných podmínek francouzského TTS z roku 2023 degraduje, což zdůrazňuje důležitost validace mimo doménu. Další experimenty ukazují, že pečlivě prořezané ensembly modelů mohou mírně překonat nejlepší jednotlivý model a zároveň se udržet v hranicích zpracování v reálném čase. Závěrem uvádíme několik pozorování, která by měla vést k vývoji výpočetně efektivních a doménově robustních predikčních systémů MOS.	cz
dc.format	23
dc.identifier.document-number	001550816100009
dc.identifier.doi	10.1109/ACCESS.2025.3596644
dc.identifier.issn	2169-3536
dc.identifier.obd	43947269
dc.identifier.orcid	Kunešová, Marie 0000-0002-7187-8481
dc.identifier.orcid	Matoušek, Jindřich 0000-0002-7408-7730
dc.identifier.orcid	Lehečka, Jan 0000-0002-3889-8069
dc.identifier.orcid	Švec, Jan 0000-0001-8362-5927
dc.identifier.orcid	Tihelka, Daniel 0000-0002-3149-2330
dc.identifier.orcid	Hanzlíček, Zdeněk 0000-0002-4001-9289
dc.identifier.uri	http://hdl.handle.net/11025/67493
dc.language.iso	en
dc.project.ID	GA22-27800S
dc.relation.ispartofseries	IEEE Access
dc.rights.access	A
dc.subject	mean opinion score	en
dc.subject	MOS prediction	en
dc.subject	speech quality assessment	en
dc.subject	speech synthesis	en
dc.subject	mean opinion score	cz
dc.subject	predikce MOS	cz
dc.subject	hodnocení kvality řeči	cz
dc.subject	syntéza řeči	cz
dc.title	Three Years of VoiceMOS Challenges: Lessons Learned by the UWB-NTIS-TTS Team	en
dc.title	Tři roky soutěží VoiceMOS: Poznatky získané týmem UWB-NTIS-TTS	cz
dc.type	Článek v databázi WoS (Jimp)
dc.type	ČLÁNEK
dc.type.status	Published Version
local.files.count	1	*
local.files.size	4917913	*
local.has.files	yes	*
local.identifier.eid	2-s2.0-105013092681

Files

Original bundle

Showing 1 - 1 out of 1 results

Name:: 10.1109_ACCESS.2025.3596644.pdf
Size:: 4.69 MB
Format:: Adobe Portable Document Format

Download

License bundle

Showing 1 - 1 out of 1 results

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Articles (NTIS)