Comparison of wav2vec 2.0 models on three speech processing tasks

Kunešová, Marie

Comparison of wav2vec 2.0 models on three speech processing tasks

dc.contributor.author	Kunešová, Marie
dc.contributor.author	Zajíc, Zbyněk
dc.contributor.author	Šmídl, Luboš
dc.contributor.author	Karafiát, Martin
dc.date.accessioned	2025-06-20T08:26:45Z
dc.date.available	2025-06-20T08:26:45Z
dc.date.issued	2024
dc.date.updated	2025-06-20T08:26:45Z
dc.description.abstract	The current state-of-the-art for various speech processing problems is a sequence-to-sequence model based on a self-attention mechanism known as transformer. The widely used wav2vec 2.0 is a self-supervised transformer model pre-trained on large amounts of unlabeled speech and then fine-tuned for a specific task. The data used for training and fine-tuning, along with the size of the transformer model, play a crucial role in both of these training steps. The most commonly used wav2vec 2.0 models are trained on relatively “clean” data from sources such as the LibriSpeech dataset, but we can expect there to be a benefit in using more realistic data gathered from a variety of acoustic conditions. However, it is not entirely clear how big the difference would be. Investigating this is the main goal of our article. To this end, we utilize wav2vec 2.0 models in three fundamental speech processing tasks: speaker change detection, voice activity detection, and overlapped speech detection, and test them on four real conversation datasets. We compare four wav2vec 2.0 models with different sizes and different data used for pre-training, and we fine-tune them either on in-domain data from the same dataset or on artificial training data created from the LibriSpeech corpus. Our results suggest that richer data that are more similar to the task domain bring better performance than a larger model.	en
dc.description.abstract	Současným nejmodernějším přístupem k řešení různých úloh zpracování řeči je "sequence-to-sequence" model založený na mechanismu self-attention, známý jako transformer. Široce používaný wav2vec 2.0 je samoučící se transformerový model, který je předtrénován na velkém množství neoznačených řečových dat a následně doladěn pro konkrétní úlohu. Data použitá pro trénování a doladění, spolu s velikostí transformerového modelu, hrají zásadní roli v obou těchto fázích trénování. Nejčastěji používané modely wav2vec 2.0 jsou trénovány na relativně „čistých“ datech, například z datasetu LibriSpeech, avšak lze očekávat, že použití realističtějších dat nahraných za různých akustických podmínek by mohlo přinést výhody. Není však zcela jasné, jak velký rozdíl toto přinese. Zkoumání této otázky je proto hlavním cílem našeho článku. Za tímto účelem využíváme modely wav2vec 2.0 ve třech základních úlohách zpracování řeči: detekce změny řečníka, detekce řečové aktivity a detekce překrývající se řeči, a testujeme je na čtyřech reálných datasetech konverzační řeči. Srovnáváme čtyři modely wav2vec 2.0 o různých velikostech a s různými daty použitými pro předtrénování a ladíme je buď na "in-domain" datech ze stejného datasetu, nebo na uměle vytvořených trénovacích datech z korpusu LibriSpeech. Naše výsledky naznačují, že bohatší data, která jsou více podobná doméně dané úlohy, přinášejí lepší výsledky než větší model.	cz
dc.format	13
dc.identifier.doi	10.1007/s10772-024-10140-6
dc.identifier.issn	1381-2416
dc.identifier.obd	43943869
dc.identifier.orcid	Kunešová, Marie 0000-0002-7187-8481
dc.identifier.orcid	Zajíc, Zbyněk 0000-0002-4153-6560
dc.identifier.orcid	Šmídl, Luboš 0000-0002-8169-2410
dc.identifier.orcid	Karafiát, Martin 0000-0001-6474-8366
dc.identifier.uri	http://hdl.handle.net/11025/59782
dc.language.iso	en
dc.project.ID	VJ01010108
dc.relation.ispartofseries	International Journal of Speech Technology
dc.rights.access	A
dc.subject	speaker change detection	en
dc.subject	voice activity detection	en
dc.subject	overlapped speech detection	en
dc.subject	wav2vec 2.0	en
dc.subject	detekce změny řečníka	cz
dc.subject	detekce řečové aktivity	cz
dc.subject	detekce překrývající se řeči	cz
dc.subject	wav2vec 2.0	cz
dc.title	Comparison of wav2vec 2.0 models on three speech processing tasks	en
dc.title	Srovnání modelů wav2vec 2.0 na třech úlohách zpracování řeči	cz
dc.type	Článek v databázi Scopus (Jsc)
dc.type	ČLÁNEK
dc.type.status	Published Version
local.files.count	1	*
local.files.size	1225018	*
local.has.files	yes	*
local.identifier.eid	2-s2.0-85206375991

Files

Original bundle

Showing 1 - 1 out of 1 results

Name:: s10772-024-10140-6.pdf
Size:: 1.17 MB
Format:: Adobe Portable Document Format

Download

License bundle

Showing 1 - 1 out of 1 results

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Articles (KKY)