Multi-label Classification and Named Entity Recognition for Historical Documents
Date issued
2025
Journal Title
Journal ISSN
Volume Title
Publisher
Springer
Abstract
In this paper, we present improvements to our processing pipeline for historical document digitization. The original pipeline is extended with two new functionalities - page labeling, and named entity recognition. We handle page labeling as a multi-label classification task, for which we choose the Query2Label approach. Query2Label is tested on our internal NKVD dataset and reaches a mean average precision equal to 80.03% on the test set. For the named entity recognition task we utilize pre-trained transformer-based models DeepPavlov and benchmark them on two entities - person name, and location. The best model reaches promising results despite not being trained on our data at all.
Description
Subject(s)
multi-label classification, named entity recognition, historical documents