Structural metadata annotation of speech corpora: comparing broadcast news and broadcast conversations

dc.contributor.authorKolář, Jáchym
dc.contributor.authorŠvec, Jan
dc.date.accessioned2016-01-06T07:50:06Z
dc.date.available2016-01-06T07:50:06Z
dc.date.issued2008
dc.description.abstract-translatedStructural metadata extraction (MDE) research aims to develop techniques for automatic conversion of raw speech recognition output to forms that are more useful to humans and to downstream automatic processes. It may be achieved by inserting boundaries of syntactic/ semantic units to the flow of speech, labeling non-content words like filled pauses and discourse markers for optional removal, and identifying sections of disfluent speech. This paper compares two Czech MDE speech corpora – one in the domain of broadcast news and the other in the domain of broadcast conversations. A variety of statistics about fillers, edit disfluencies, and syntactic/semantic units are presented. Among many others, we report the statistics indicating that disfluent portions of speech show differences in the distribution of parts of speech (POS) of their word content in comparison with the overall POS distribution. The two Czech corpora are not only compared with each other, but also with available statistics relating to English MDE corpora of broadcast news and telephone conversations.en
dc.format6 s.cs
dc.format.mimetypeapplication/pdf
dc.identifier.citationKOLÁŘ, Jáchym; ŠVEC, Jan. Structural metadata annotation of speech corpora: comparing broadcast news and broadcast conversations. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08): 28-29-30 May 2008. Marrakech: ELRA, 2008, p. [1-6]. ISBN 2-9517408-4-0.en
dc.identifier.isbn2-9517408-4-0
dc.identifier.urihttp://www.kky.zcu.cz/cs/publications/KolarJ_2008_StructuralMetadata
dc.identifier.urihttp://hdl.handle.net/11025/17111
dc.language.isoenen
dc.publisherELRAen
dc.rights© Jáchym Kolář - Jan Šveccs
dc.rights.accessopenAccessen
dc.subjectextrakce stukturálních metadatcs
dc.subjectautomatická konverze řečics
dc.subjectřečový korpuscs
dc.subject.translatedstructural metadata extractionen
dc.subject.translatedautomatic conversion of speechen
dc.subject.translatedspeech corporaen
dc.titleStructural metadata annotation of speech corpora: comparing broadcast news and broadcast conversationsen
dc.typečlánekcs
dc.typearticleen
dc.type.statusPeer-revieweden
dc.type.versionpublishedVersionen

Files

Original bundle
Showing 1 - 1 out of 1 results
No Thumbnail Available
Name:
KolarJ_2008_StructuralMetadata.pdf
Size:
80.14 KB
Format:
Adobe Portable Document Format
Description:
Plný text
License bundle
Showing 1 - 1 out of 1 results
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description:

Collections