Improving Handwritten Cyrillic OCR by Font-based Synthetic Text Generator

dc.contributor.authorGruber, Ivan
dc.contributor.authorPicek, Lukáš
dc.contributor.authorHlaváč, Miroslav
dc.contributor.authorNeduchal, Petr
dc.contributor.authorHrúz, Marek
dc.date.accessioned2025-06-20T08:55:11Z
dc.date.available2025-06-20T08:55:11Z
dc.date.issued2023
dc.date.updated2025-06-20T08:55:11Z
dc.description.abstractIn this paper, we propose a straight-forward and effective Font-based Synthetic Text Generator (FbSTG) to alleviate the need for annotated data required for not just Cyrillic handwritten text recognition. Unlike standard GAN-based methods, the FbSTG does not have to be trained to learn new characters and styles; all it needs is the fonts, the text, and sampled page backgrounds. In order to show the benefits of the newly proposed method, we train and test two different OCR systems (Tesseract, and TrOCR) on the Handwritten Kazakh and Russian dataset (HKR) both with and without synthetic data. Besides, we evaluate both systems' performance on a private NKVD dataset containing historical documents from Ukraine with a high amount of out-of-vocabulary (OoV) words representing an extremely challenging task for current state-of-the-art methods. We decreased the CER and WER significantly by adding the synthetic data with the TrOCR-Base-384 model on both datasets. More precisely, we reduced the relative error in terms of CER / WER on (i) HKR-Test1 with OoV samples by around 20% / 10%, and (ii) NKVD dataset by 24% CER and 8% WER. The FbSTG code is available at: https://github.com/mhlzcu/doc_gen.en
dc.format14
dc.identifier.doi10.1007/978-3-031-50320-7_8
dc.identifier.isbn978-3-031-50319-1
dc.identifier.issn0302-9743
dc.identifier.obd43940587
dc.identifier.orcidGruber, Ivan 0000-0003-2333-433X
dc.identifier.orcidPicek, Lukáš 0000-0002-6041-9722
dc.identifier.orcidHlaváč, Miroslav 0000-0003-1172-930X
dc.identifier.orcidNeduchal, Petr 0000-0001-5788-604X
dc.identifier.orcidHrúz, Marek 0000-0002-7851-9879
dc.identifier.urihttp://hdl.handle.net/11025/61557
dc.language.isoen
dc.project.IDDG20P02OVV018
dc.project.IDLM2023062
dc.project.ID90042
dc.publisherSpringer
dc.relation.ispartofseries6th International Conference on the Dynamics of Information Systems (DIS 2023)
dc.subjecthandwritten optical character recognitionen
dc.subjectCyrillicen
dc.subjecthandwritten text generationen
dc.subjectsynthetic dataen
dc.subjectTesseracten
dc.subjectTrOCRen
dc.subjectout-of-vocabularyen
dc.titleImproving Handwritten Cyrillic OCR by Font-based Synthetic Text Generatoren
dc.typeStať ve sborníku (D)
dc.typeSTAŤ VE SBORNÍKU
dc.type.statusPublished Version
local.files.count1*
local.files.size2778038*
local.has.filesyes*
local.identifier.eid2-s2.0-85181979766

Files

Original bundle
Showing 1 - 1 out of 1 results
No Thumbnail Available
Name:
Improving Handwritten Cyrillic OCR.pdf
Size:
2.65 MB
Format:
Adobe Portable Document Format
License bundle
Showing 1 - 1 out of 1 results
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: