Multi-label Classification and Named Entity Recognition for Historical Documents

dc.contributor.authorGruber, Ivan
dc.contributor.authorHlaváč, Miroslav
dc.contributor.authorNeduchal, Petr
dc.contributor.authorHrúz, Marek
dc.date.accessioned2026-03-30T18:05:35Z
dc.date.available2026-03-30T18:05:35Z
dc.date.issued2025
dc.date.updated2026-03-30T18:05:35Z
dc.description.abstractIn this paper, we present improvements to our processing pipeline for historical document digitization. The original pipeline is extended with two new functionalities - page labeling, and named entity recognition. We handle page labeling as a multi-label classification task, for which we choose the Query2Label approach. Query2Label is tested on our internal NKVD dataset and reaches a mean average precision equal to 80.03% on the test set. For the named entity recognition task we utilize pre-trained transformer-based models DeepPavlov and benchmark them on two entities - person name, and location. The best model reaches promising results despite not being trained on our data at all.en
dc.format11
dc.identifier.document-number001534826000002
dc.identifier.doi10.1007/978-3-031-81010-7_2
dc.identifier.isbn978-3-031-81009-1
dc.identifier.issn0302-9743
dc.identifier.obd43944197
dc.identifier.orcidGruber, Ivan 0000-0003-2333-433X
dc.identifier.orcidHlaváč, Miroslav 0000-0003-1172-930X
dc.identifier.orcidNeduchal, Petr 0000-0001-5788-604X
dc.identifier.orcidHrúz, Marek 0000-0002-7851-9879
dc.identifier.urihttp://hdl.handle.net/11025/67464
dc.language.isoen
dc.project.IDDH23P03OVV073
dc.publisherSpringer
dc.relation.ispartofseries7th International Conference on the Dynamics of Information Systems, DIS 2024
dc.subjectmulti-label classificationen
dc.subjectnamed entity recognitionen
dc.subjecthistorical documentsen
dc.titleMulti-label Classification and Named Entity Recognition for Historical Documentsen
dc.typeStať ve sborníku (D)
dc.typeSTAŤ VE SBORNÍKU
dc.type.statusPublished Version
local.files.count1*
local.files.size964848*
local.has.filesyes*
local.identifier.eid2-s2.0-105031157129

Files

Original bundle
Showing 1 - 1 out of 1 results
No Thumbnail Available
Name:
gruber_Multi-label Classification.pdf
Size:
942.23 KB
Format:
Adobe Portable Document Format
License bundle
Showing 1 - 1 out of 1 results
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: