Improving Word meaning representations using Wikipedia categories

Svoboda, Lukáš

Improving Word meaning representations using Wikipedia categories

dc.contributor.author	Svoboda, Lukáš
dc.contributor.author	Brychcín, Tomáš
dc.date.accessioned	2019-06-10T10:00:09Z
dc.date.available	2019-06-10T10:00:09Z
dc.date.issued	2018
dc.description.abstract	V tomto článku prezentujeme metody Skip-gram a CBOW pro extrakci reprezentace významu slov rozšířené o globální informaci. Využíváme vlastní korpus, který včetně globální informace generujeme z Wikipedie, kde jsou články organizovány hierarchicky dle kategorií. Tyto kategorie poskytují dodatečné a velmi užitečné informace (popis) o každém článku. Představujeme čtyři nové modely, jak obohatit reprezentaci slovních významů s využitím globální informace. Experimentujeme s anglickou Wikipedií a testujeme naše modely na standardních datových souborech podobnosti slov a korpusu slovních analogií. Navržené modely výrazně překonávají standardní metody reprezentace slov, zejména při trénování na velikostně podobných korpusech a poskytují podobné výsledky ve srovnání s metodami trénovanými na mnohem větších souborech dat. Náš nový přístup ukazuje, že zvyšování množství trénovacích dat nemusí zvyšovat kvalitu reprezentace významu slov tolik, jako je trénování s využitím globální informace, nebo jak se ukazuje u nových přístupů , které pracují s vnitřní informací daného slova na bázi jednotlivých znaků (fastText).	cs
dc.description.abstract-translated	In this paper we extend Skip-Gram and Continuous Bag-of-Words Distributional word representations models via global context information. We use a corpus extracted from Wikipedia, where articles are organized in a hierarchy of categories. These categories provide useful topical information about each article. We present the four new approaches, how to enrich word meaning representation with such information. We experiment with the English Wikipedia and evaluate our models on standard word similarity and word analogy datasets. Proposed models significantly outperform other word representation methods when similar size training data of similar size is used and provide similar performance compared with methods trained on much larger datasets. Our new approach shows, that increasing the amount of unlabelled data does not necessarily increase the performance of word embeddings as much as introducing the global or sub-word information, especially when training time is taken into the consideration.	en
dc.format	12 s.	cs
dc.format.mimetype	application/pdf
dc.identifier.citation	SVOBODA, L., BRYCHCÍN, T. Improving Word meaning representations using Wikipedia categories. Neural Network World, 2018, roč. 28, č. 6, s. 523-534. ISSN 1210-0552.	en
dc.identifier.doi	10.14311/NNW.2018.28.029
dc.identifier.issn	1210-0552
dc.identifier.obd	43926048
dc.identifier.uri	2-s2.0-85061489302
dc.identifier.uri	http://hdl.handle.net/11025/34807
dc.language.iso	en	en
dc.project.ID	SGS-2016-018/Datové a softwarové inženýrství pro komplexní aplikace	cs
dc.publisher	Institute of Computer Science	en
dc.rights	© Institute of Computer Science	en
dc.rights.access	openAccess	en
dc.subject	distribuční sémantika	cs
dc.subject	vylepšení word2vec	cs
dc.subject	vnořená slova	cs
dc.subject	globální informace	cs
dc.subject	wikipedia	cs
dc.subject	CBOW	cs
dc.subject	Skip-gram	cs
dc.subject	číselná reprezentace slov	cs
dc.subject.translated	Word2vec	en
dc.subject.translated	skipgram	en
dc.subject.translated	cbow	en
dc.subject.translated	improving distributional word representation	en
dc.subject.translated	using global information	en
dc.subject.translated	new approach	en
dc.title	Improving Word meaning representations using Wikipedia categories	en
dc.title.alternative	Vylepšení reprezentace slovních vektorů s využitím kategorií z Wikipedie	cs
dc.type	článek	cs
dc.type	article	en
dc.type.status	Peer-reviewed	en
dc.type.version	publishedVersion	en

Collections

OBD
Articles (KIV)

Improving Word meaning representations using Wikipedia categories

Files

Collections