Information extraction from the web by matching visual presentation patterns
dc.contributor.author | Minárik, Matej | |
dc.contributor.author | Burget, Radek | |
dc.contributor.editor | Steinberger, Josef | |
dc.contributor.editor | Zíma, Martin | |
dc.contributor.editor | Fiala, Dalibor | |
dc.contributor.editor | Dostal, Martin | |
dc.contributor.editor | Nykl, Michal | |
dc.date.accessioned | 2017-10-09T12:39:34Z | |
dc.date.available | 2017-10-09T12:39:34Z | |
dc.date.issued | 2017 | |
dc.description.abstract-translated | There is a large amount of data available on the Web. Data are often represented as text, enriched with tables, lists, images or other visual structures. These data are usually coded in HTML without any additional semantics, which makes them nigh impossible to automatically process and extract. There are ap-proaches based on top-down document segmentation according to visual infor-mation and layout. We present a bottom-up approach which starts with the smallest consistent elements and matches the visual relationships among these elements to a pre-defined ontological structure of extracted records. This meth-od considers not only the visual attributes of a particular segment, but also its position amongst other segments. | en |
dc.format | 5 s. | cs |
dc.format.mimetype | application/pdf | |
dc.identifier.citation | STEINBERGER, Josef ed.; ZÍMA, Martin ed.; FIALA, Dalibor ed.; DOSTAL, Martin ed.; NYKL, Michal ed. Data a znalosti 2017: sborník konference, Plzeň, Hotel Angelo 5. - 6. října 2017. 1. vyd. Plzeň: Západočeská univerzita v Plzni, 2017, s. 227-231. ISBN 978-80-261-0720-0. | cs |
dc.identifier.isbn | 978-80-261-0720-0 | |
dc.identifier.uri | https://www.zcu.cz/export/sites/zcu/pracoviste/vyd/online/DataAZnalosti2017.pdf | |
dc.identifier.uri | http://hdl.handle.net/11025/26368 | |
dc.language.iso | en | en |
dc.publisher | Západočeská univerzita v Plzni | cs |
dc.rights | © Západočeská univerzita v Plzni | cs |
dc.rights.access | openAccess | en |
dc.subject | integrace webových dat | cs |
dc.subject | extrakce informací | cs |
dc.subject | strukturovaná extrakce záznamů | cs |
dc.subject | segmentace stránek | cs |
dc.subject | klasifikace obsahu | cs |
dc.subject | mapování ontologií | cs |
dc.subject.translated | web data integration | en |
dc.subject.translated | information extraction | en |
dc.subject.translated | structured record extraction | en |
dc.subject.translated | page segmentation | en |
dc.subject.translated | content classification | en |
dc.subject.translated | ontology mapping | en |
dc.title | Information extraction from the web by matching visual presentation patterns | en |
dc.type | konferenční příspěvek | cs |
dc.type | conferenceObject | en |
dc.type.status | Peer-reviewed | en |
dc.type.version | publishedVersion | en |
Files
Original bundle
1 - 1 out of 1 results
No Thumbnail Available
- Name:
- Minarik.pdf
- Size:
- 376.7 KB
- Format:
- Adobe Portable Document Format
- Description:
- Plný text
License bundle
1 - 1 out of 1 results
No Thumbnail Available
- Name:
- license.txt
- Size:
- 1.71 KB
- Format:
- Item-specific license agreed upon to submission
- Description: