Is it Possible to Re-educate RoBERTa? Expert-driven Machine Learning for Punctuation Correction.

Machura, Jakub

Is it Possible to Re-educate RoBERTa? Expert-driven Machine Learning for Punctuation Correction.

dc.contributor.author	Machura, Jakub
dc.contributor.author	Hana, Žižková
dc.contributor.author	Frémund, Adam
dc.contributor.author	Švec, Jan
dc.date.accessioned	2025-06-20T08:48:39Z
dc.date.available	2025-06-20T08:48:39Z
dc.date.issued	2023
dc.date.updated	2025-06-20T08:48:39Z
dc.description.abstract	Although Czech rule-based tools for automatic punctuation insertion rely on extensive grammar and achieve respectable precision, the pre-trained Transformers outperform rule-based systems in precision and recall (Machura et al. 2022). The Czech pre-trained RoBERTa model achieves excellent results, yet a certain level of phenomena is ignored, and the model partially makes errors. This paper aims to investigate whether it ispossible to retrain the RoBERTa language model to increase the number of sentence commas the model correctly detects. We have chosen a very specific and narrow type of sentence comma, namely the sentence comma delimiting vocative phrases, which is clearly defined in the grammar and is very often omitted by writers. The chosen approaches were further tested and evaluated on different types of texts.	en
dc.description.abstract	Přestože české nástroje pro automatické vkládání interpunkce založené na pravidlech se opírají o rozsáhlou gramatiku a dosahují úctyhodné přesnosti, předtrénované transformátory překonávají systémy založené na pravidlech v přesnosti a odvolání (Machura et al. 2022). Český předtrénovaný model RoBERTa dosahuje výborných výsledků, přesto je určitá úroveň jevů ignorována a model se částečně dopouští chyb. Cílem tohoto článku je prozkoumat, zda je možné přetrénovat jazykový model RoBERTa tak, aby se zvýšil počet vět s čárkami, které model správně detekuje. Vybrali jsme si velmi specifický a úzký typ čárky ve větě, a to čárku ve větě ohraničující vokativní fráze, která je v gramatice jasně definována a je pisateli velmi často opomíjena. Zvolené přístupy jsme dále testovali a vyhodnocovali na různých typech textů.	cz
dc.format	12
dc.identifier.doi	10.2478/jazcas-2023-0052
dc.identifier.issn	0021-5597
dc.identifier.obd	43940897
dc.identifier.orcid	Frémund, Adam 0000-0001-8780-6629
dc.identifier.orcid	Švec, Jan 0000-0001-8362-5927
dc.identifier.uri	http://hdl.handle.net/11025/61224
dc.language.iso	en
dc.project.ID	GA22-27800S
dc.relation.ispartofseries	Journal of Linguistics
dc.rights.access	A
dc.subject	comma	en
dc.subject	Czech	en
dc.subject	vocative	en
dc.subject	machine learning	en
dc.subject	RoBERTa	en
dc.subject	čárka	cz
dc.subject	čeština	cz
dc.subject	vokativ	cz
dc.subject	strojové učení	cz
dc.subject	RoBERTa	cz
dc.title	Is it Possible to Re-educate RoBERTa? Expert-driven Machine Learning for Punctuation Correction.	en
dc.title	Je možné přeučit RoBERTa? Expertně řízené strojové učení pro opravu interpunkce.	cz
dc.type	Článek v databázi Scopus (Jsc)
dc.type	ČLÁNEK
dc.type.status	Published Version
local.files.count	1	*
local.files.size	1859095	*
local.has.files	yes	*
local.identifier.eid	2-s2.0-85181757325

Files

Original bundle

Showing 1 - 1 out of 1 results

Name:: Is-it-Possible-to-ReEducate-Roberta-ExpertDriven-Machine-Learning-for-Punctuation-Correction.pdf
Size:: 1.77 MB
Format:: Adobe Portable Document Format

Download

License bundle

Showing 1 - 1 out of 1 results

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Articles (KKY)