Czech news dataset for semantic textual similarity

Sido, Jakub

Czech news dataset for semantic textual similarity

dc.contributor.author	Sido, Jakub
dc.contributor.author	Seják, Michal
dc.contributor.author	Pražák, Ondřej
dc.contributor.author	Konopík, Miloslav
dc.contributor.author	Moravec, Václav
dc.date.accessioned	2026-03-24T19:05:41Z
dc.date.available	2026-03-24T19:05:41Z
dc.date.issued	2025
dc.date.updated	2026-03-24T19:05:41Z
dc.description.abstract	This paper describes a novel dataset consisting of sentences with two different semantic similarity annotations; with and without surrounding context. The data originate from the journalistic domain in the Czech language. The final dataset contains 138,556 human annotations divided into train and test sets. In total, 485 journalism students participated in the creation process. To increase the reliability of the test set, we compute the final annotations as an average of 9 individual annotation scores. We evaluate the dataset quality by measuring inter and intra-annotator agreements. Besides agreement numbers, we provide detailed statistics of the collected dataset. We conclude our paper with a baseline experiment of building a system for predicting the semantic similarity of sentences. Due to the massive number of training annotations (116,956), the model significantly outperforms an average annotator (0.92 versus 0.86 of Pearson’s correlation coefficient).	en
dc.description.abstract	Tento článek popisuje nový soubor dat, který se skládá z vět se dvěma různými anotacemi sémantické podobnosti: s okolním kontextem a bez něj. Data pocházejí z publicistické oblasti v českém jazyce. Výsledná datová sada obsahuje 138 556 lidských anotací rozdělených do trénovací a testovací množiny. Na tvorbě se podílelo celkem 485 studentů žurnalistiky. Pro zvýšení spolehlivosti testovací sady jsme výsledné anotace vypočítali jako průměr 9 individuálních anotačních skóre. Kvalitu datové sady hodnotíme měřením shody mezi jednotlivými anotátory a mezi anotátory navzájem. Kromě čísel shody uvádíme podrobné statistiky shromážděné datové sady. V závěru našeho příspěvku uvádíme základní experiment sestavení systému pro předpovídání sémantické podobnosti vět. Díky obrovskému počtu tréninkových anotací (116 956) model výrazně překonává průměrného anotátora (0,92 oproti 0,86 Pearsonova korelačního koeficientu).	cz
dc.format	18
dc.identifier.document-number	001371498800001
dc.identifier.doi	10.1007/s10579-024-09795-z
dc.identifier.issn	1574-020X
dc.identifier.obd	43944861
dc.identifier.orcid	Sido, Jakub 0000-0002-7709-7512
dc.identifier.orcid	Seják, Michal 0009-0008-0365-898X
dc.identifier.orcid	Pražák, Ondřej 0000-0001-5445-7792
dc.identifier.orcid	Konopík, Miloslav 0000-0001-7397-1658
dc.identifier.orcid	Moravec, Václav 0000-0002-3349-0785
dc.identifier.uri	http://hdl.handle.net/11025/67366
dc.language.iso	en
dc.project.ID	SGS-2022-016
dc.relation.ispartofseries	Language Resources and Evaluation
dc.rights.access	C
dc.subject	semantics	en
dc.subject	context	en
dc.subject	dataset	en
dc.subject	human annotation	en
dc.subject	sémantika	cz
dc.subject	kontext	cz
dc.subject	dataset	cz
dc.subject	lidské anotace	cz
dc.title	Czech news dataset for semantic textual similarity	en
dc.title	Český dataset pro sémantickou podobnost textu	cz
dc.type	Článek v databázi WoS (Jimp)
dc.type	ČLÁNEK
dc.type.status	Published Version
local.files.count	1	*
local.files.size	1344376	*
local.has.files	yes	*
local.identifier.eid	2-s2.0-85211814751

Files

Original bundle

Showing 1 - 1 out of 1 results

Name:: s10579-024-09795-z.pdf
Size:: 1.28 MB
Format:: Adobe Portable Document Format

Download

License bundle

Showing 1 - 1 out of 1 results

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Articles (KIV)