Improving Handwritten Cyrillic OCR by Font-based Synthetic Text Generator

Gruber, Ivan

Improving Handwritten Cyrillic OCR by Font-based Synthetic Text Generator

dc.contributor.author	Gruber, Ivan
dc.contributor.author	Picek, Lukáš
dc.contributor.author	Hlaváč, Miroslav
dc.contributor.author	Neduchal, Petr
dc.contributor.author	Hrúz, Marek
dc.date.accessioned	2025-06-20T08:55:11Z
dc.date.available	2025-06-20T08:55:11Z
dc.date.issued	2023
dc.date.updated	2025-06-20T08:55:11Z
dc.description.abstract	In this paper, we propose a straight-forward and effective Font-based Synthetic Text Generator (FbSTG) to alleviate the need for annotated data required for not just Cyrillic handwritten text recognition. Unlike standard GAN-based methods, the FbSTG does not have to be trained to learn new characters and styles; all it needs is the fonts, the text, and sampled page backgrounds. In order to show the benefits of the newly proposed method, we train and test two different OCR systems (Tesseract, and TrOCR) on the Handwritten Kazakh and Russian dataset (HKR) both with and without synthetic data. Besides, we evaluate both systems' performance on a private NKVD dataset containing historical documents from Ukraine with a high amount of out-of-vocabulary (OoV) words representing an extremely challenging task for current state-of-the-art methods. We decreased the CER and WER significantly by adding the synthetic data with the TrOCR-Base-384 model on both datasets. More precisely, we reduced the relative error in terms of CER / WER on (i) HKR-Test1 with OoV samples by around 20% / 10%, and (ii) NKVD dataset by 24% CER and 8% WER. The FbSTG code is available at: https://github.com/mhlzcu/doc_gen.	en
dc.format	14
dc.identifier.doi	10.1007/978-3-031-50320-7_8
dc.identifier.isbn	978-3-031-50319-1
dc.identifier.issn	0302-9743
dc.identifier.obd	43940587
dc.identifier.orcid	Gruber, Ivan 0000-0003-2333-433X
dc.identifier.orcid	Picek, Lukáš 0000-0002-6041-9722
dc.identifier.orcid	Hlaváč, Miroslav 0000-0003-1172-930X
dc.identifier.orcid	Neduchal, Petr 0000-0001-5788-604X
dc.identifier.orcid	Hrúz, Marek 0000-0002-7851-9879
dc.identifier.uri	http://hdl.handle.net/11025/61557
dc.language.iso	en
dc.project.ID	DG20P02OVV018
dc.project.ID	LM2023062
dc.project.ID	90042
dc.publisher	Springer
dc.relation.ispartofseries	6th International Conference on the Dynamics of Information Systems (DIS 2023)
dc.subject	handwritten optical character recognition	en
dc.subject	Cyrillic	en
dc.subject	handwritten text generation	en
dc.subject	synthetic data	en
dc.subject	Tesseract	en
dc.subject	TrOCR	en
dc.subject	out-of-vocabulary	en
dc.title	Improving Handwritten Cyrillic OCR by Font-based Synthetic Text Generator	en
dc.type	Stať ve sborníku (D)
dc.type	STAŤ VE SBORNÍKU
dc.type.status	Published Version
local.files.count	1	*
local.files.size	2778038	*
local.has.files	yes	*
local.identifier.eid	2-s2.0-85181979766

Files

Original bundle

Showing 1 - 1 out of 1 results

Name:: Improving Handwritten Cyrillic OCR.pdf
Size:: 2.65 MB
Format:: Adobe Portable Document Format

Download

License bundle

Showing 1 - 1 out of 1 results

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Conference papers (NTIS)