Exploring the Relationship between Dataset Size and Image Captioning Model Performance

dc.contributor.authorŽelezný, Tomáš
dc.contributor.authorHrúz, Marek
dc.date.accessioned2025-06-20T08:28:58Z
dc.date.available2025-06-20T08:28:58Z
dc.date.issued2023
dc.date.updated2025-06-20T08:28:58Z
dc.description.abstractImage captioning is a deep learning task that involves computer vision methods to extract visual information from the image and also natural language processing to generate the result caption in natural language. Image captioning models, just like other deep learning models, need a large amount of training data and require a long time to train. In this work, we investigate the impact of using a smaller amount of training data on the performance of the standard image captioning model Oscar. We train Oscar on different sizes of the training dataset and measure its performance in terms of accuracy and computational complexity. We observe that the computational time increases linearly with the amount of data used for training. However, the accuracy does not follow this linear trend and the relative improvement diminishes as we add more data to the training. We also measure the consistency of individual sizes of the training sets and observe that the more data we use for training the more consistent the metrics are. In addition to traditional evaluation metrics, we evaluate the performance using CLIP similarity. We investigate whether it can be used as a fully-fledged metric providing a unique advantage over the traditional metrics; it does not need reference captions that had to be acquired by human annotators. Our results show a high correlation between CLIP with the other metrics. This work provides valuable insights for understanding the requirements for training effective image captioning models. We believe our results can be transferred to other models, even in other deep-learning tasks. © 2023 Copyright for this paper by its authors.en
dc.format8
dc.identifier.isbnneuvedeno
dc.identifier.issn1613-0073
dc.identifier.obd43939139
dc.identifier.orcidŽelezný, Tomáš 0000-0002-0974-7069
dc.identifier.orcidHrúz, Marek 0000-0002-7851-9879
dc.identifier.urihttp://hdl.handle.net/11025/59946
dc.language.isoen
dc.project.IDSGS-2022-017
dc.project.ID90140
dc.project.ID90104
dc.publisherCEUR-WS
dc.relation.ispartofseries26th Computer Vision Winter Workshop, CVWW 2023
dc.subjectcomputer visionen
dc.subjectdata size analysisen
dc.subjectdeep learningen
dc.subjectimage captioningen
dc.subjectmachine learningen
dc.titleExploring the Relationship between Dataset Size and Image Captioning Model Performanceen
dc.typeStať ve sborníku (D)
dc.typeSTAŤ VE SBORNÍKU
dc.type.statusPublished Version
local.files.count1*
local.files.size1742344*
local.has.filesyes*
local.identifier.eid2-s2.0-85149375642

Files

Original bundle
Showing 1 - 1 out of 1 results
No Thumbnail Available
Name:
Zelezny_Hruz_CEUR_2023.pdf
Size:
1.66 MB
Format:
Adobe Portable Document Format
License bundle
Showing 1 - 1 out of 1 results
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: