Automatic data generation of incorrect image-text pairs for effective contrastive learning of CLIP model

dc.contributor.authorTagami, Rina
dc.contributor.authorKobayashi, Hiroki
dc.contributor.authorAkizuki, Shuichi
dc.contributor.authorHashimoto, Manabu
dc.contributor.editorSkala, Václav
dc.date.accessioned2024-07-28T18:41:49Z
dc.date.available2024-07-28T18:41:49Z
dc.date.issued2024
dc.description.abstract-translatedIn this study, we proposed a method for automatically generating high-quality CLIP(Contrastive Language Image Pre-training) training data to improve the performance of text-based image retrieval using CLIP. In general, two types of image-text pair data are used in CLIP training: correct pairs and incorrect pairs. correct pairs are pairs in which the image and text content are compatible, and are created by scraping or other methods. incorrect pairs are incompatible image-text pairs, which are created by changing the combination of the correct pairs. CLIP is completed by contrastive training to increase the similarity between the image and text in correct pairs and decrease the similarity in incorrect pairs. However, when there are multiple images in the training data that are similar to each other, the text attached to them is also considered to be similar to each other, and although it is preferable to treat them as correct pairs, changed pairs are treated as incorrect pairs. In other words, incorrect pairs with high relevance between image texts are learned as having low relevance between image texts, and this inconsistency has a negative impact on the CLIP model. Therefore, if two images taken from the training data are not similar, then the similarity between texts assigned to them should also be low, so that a highly reliable incorrect pair can be created by exchanging the assigned text with each other. We applied this idea to the results of clustering the images and texts in the training data, respectively, and used the similarity between the clusters to generate an incorrect pair, then learned to increase the negative effect as the similarity between images was lower. The results of an experiment using the Amazon review dataset, which is commonly used in this field, showed a 21.0% improvement in Rank@1 score compared to vanilla CLIPen
dc.format10 s.cs
dc.format.mimetypeapplication/pdf
dc.identifier.citationWSCG 2024: full papers proceedings: 32. International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision, p. 187-196.en
dc.identifier.doihttps://doi.org/10.24132/CSRN.3401.20
dc.identifier.issn2464–4625 (online)
dc.identifier.issn2464–4617 (print)
dc.identifier.urihttp://hdl.handle.net/11025/57391
dc.language.isoenen
dc.publisherVáclav Skala - UNION Agencyen
dc.rights© Václav Skala - UNION Agencyen
dc.rights.accessopenAccessen
dc.subjectvelké jazykové modelycs
dc.subjectnačítání obrázkůcs
dc.subjectobrazovo-textový datasetcs
dc.subjectCLIPcs
dc.subjectkontrastivní učenícs
dc.subjectk jako clusteringcs
dc.subject.translatedlarge language modelsen
dc.subject.translatedimage retrievalen
dc.subject.translatedimage-text dataseten
dc.subject.translatedCLIPen
dc.subject.translatedcontrastive learningen
dc.subject.translatedk-Means Clusteringen
dc.titleAutomatic data generation of incorrect image-text pairs for effective contrastive learning of CLIP modelen
dc.typekonferenční příspěvekcs
dc.typeconferenceObjecten
dc.type.statusPeer revieweden
dc.type.versionpublishedVersionen

Files

Original bundle
Showing 1 - 1 out of 1 results
No Thumbnail Available
Name:
C02-2024.pdf
Size:
7.26 MB
Format:
Adobe Portable Document Format
Description:
Plný text
License bundle
Showing 1 - 1 out of 1 results
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: