Information extraction from the web by matching visual presentation patterns

Minárik, Matej

Information extraction from the web by matching visual presentation patterns

dc.contributor.author	Minárik, Matej
dc.contributor.author	Burget, Radek
dc.contributor.editor	Steinberger, Josef
dc.contributor.editor	Zíma, Martin
dc.contributor.editor	Fiala, Dalibor
dc.contributor.editor	Dostal, Martin
dc.contributor.editor	Nykl, Michal
dc.date.accessioned	2017-10-09T12:39:34Z
dc.date.available	2017-10-09T12:39:34Z
dc.date.issued	2017
dc.description.abstract-translated	There is a large amount of data available on the Web. Data are often represented as text, enriched with tables, lists, images or other visual structures. These data are usually coded in HTML without any additional semantics, which makes them nigh impossible to automatically process and extract. There are ap-proaches based on top-down document segmentation according to visual infor-mation and layout. We present a bottom-up approach which starts with the smallest consistent elements and matches the visual relationships among these elements to a pre-defined ontological structure of extracted records. This meth-od considers not only the visual attributes of a particular segment, but also its position amongst other segments.	en
dc.format	5 s.	cs
dc.format.mimetype	application/pdf
dc.identifier.citation	STEINBERGER, Josef ed.; ZÍMA, Martin ed.; FIALA, Dalibor ed.; DOSTAL, Martin ed.; NYKL, Michal ed. Data a znalosti 2017: sborník konference, Plzeň, Hotel Angelo 5. - 6. října 2017. 1. vyd. Plzeň: Západočeská univerzita v Plzni, 2017, s. 227-231. ISBN 978-80-261-0720-0.	cs
dc.identifier.isbn	978-80-261-0720-0
dc.identifier.uri	http://hdl.handle.net/11025/26368
dc.language.iso	en	en
dc.publisher	Západočeská univerzita v Plzni	cs
dc.rights	© Západočeská univerzita v Plzni	cs
dc.rights.access	openAccess	en
dc.subject	integrace webových dat	cs
dc.subject	extrakce informací	cs
dc.subject	strukturovaná extrakce záznamů	cs
dc.subject	segmentace stránek	cs
dc.subject	klasifikace obsahu	cs
dc.subject	mapování ontologií	cs
dc.subject.translated	web data integration	en
dc.subject.translated	information extraction	en
dc.subject.translated	structured record extraction	en
dc.subject.translated	page segmentation	en
dc.subject.translated	content classification	en
dc.subject.translated	ontology mapping	en
dc.title	Information extraction from the web by matching visual presentation patterns	en
dc.type	konferenční příspěvek	cs
dc.type	conferenceObject	en
dc.type.status	Peer-reviewed	en
dc.type.version	publishedVersion	en

Files

Original bundle

Showing 1 - 1 out of 1 results

Name:: Minarik.pdf
Size:: 376.7 KB
Format:: Adobe Portable Document Format
Description:: Plný text

Download

License bundle

Showing 1 - 1 out of 1 results

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Data a znalosti 2017
Data a znalosti 2017