Information extraction from the web by matching visual presentation patterns

dc.contributor.authorMinárik, Matej
dc.contributor.authorBurget, Radek
dc.contributor.editorSteinberger, Josef
dc.contributor.editorZíma, Martin
dc.contributor.editorFiala, Dalibor
dc.contributor.editorDostal, Martin
dc.contributor.editorNykl, Michal
dc.date.accessioned2017-10-09T12:39:34Z
dc.date.available2017-10-09T12:39:34Z
dc.date.issued2017
dc.description.abstract-translatedThere is a large amount of data available on the Web. Data are often represented as text, enriched with tables, lists, images or other visual structures. These data are usually coded in HTML without any additional semantics, which makes them nigh impossible to automatically process and extract. There are ap-proaches based on top-down document segmentation according to visual infor-mation and layout. We present a bottom-up approach which starts with the smallest consistent elements and matches the visual relationships among these elements to a pre-defined ontological structure of extracted records. This meth-od considers not only the visual attributes of a particular segment, but also its position amongst other segments.en
dc.format5 s.cs
dc.format.mimetypeapplication/pdf
dc.identifier.citationSTEINBERGER, Josef ed.; ZÍMA, Martin ed.; FIALA, Dalibor ed.; DOSTAL, Martin ed.; NYKL, Michal ed. Data a znalosti 2017: sborník konference, Plzeň, Hotel Angelo 5. - 6. října 2017. 1. vyd. Plzeň: Západočeská univerzita v Plzni, 2017, s. 227-231. ISBN 978-80-261-0720-0.cs
dc.identifier.isbn978-80-261-0720-0
dc.identifier.urihttps://www.zcu.cz/export/sites/zcu/pracoviste/vyd/online/DataAZnalosti2017.pdf
dc.identifier.urihttp://hdl.handle.net/11025/26368
dc.language.isoenen
dc.publisherZápadočeská univerzita v Plznics
dc.rights© Západočeská univerzita v Plznics
dc.rights.accessopenAccessen
dc.subjectintegrace webových datcs
dc.subjectextrakce informacícs
dc.subjectstrukturovaná extrakce záznamůcs
dc.subjectsegmentace stránekcs
dc.subjectklasifikace obsahucs
dc.subjectmapování ontologiícs
dc.subject.translatedweb data integrationen
dc.subject.translatedinformation extractionen
dc.subject.translatedstructured record extractionen
dc.subject.translatedpage segmentationen
dc.subject.translatedcontent classificationen
dc.subject.translatedontology mappingen
dc.titleInformation extraction from the web by matching visual presentation patternsen
dc.typekonferenční příspěvekcs
dc.typeconferenceObjecten
dc.type.statusPeer-revieweden
dc.type.versionpublishedVersionen

Files

Original bundle
Showing 1 - 1 out of 1 results
No Thumbnail Available
Name:
Minarik.pdf
Size:
376.7 KB
Format:
Adobe Portable Document Format
Description:
Plný text
License bundle
Showing 1 - 1 out of 1 results
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: