Structural metadata annotation of speech corpora: comparing broadcast news and broadcast conversations

Kolář, Jáchym

Structural metadata annotation of speech corpora: comparing broadcast news and broadcast conversations

dc.contributor.author	Kolář, Jáchym
dc.contributor.author	Švec, Jan
dc.date.accessioned	2016-01-06T07:50:06Z
dc.date.available	2016-01-06T07:50:06Z
dc.date.issued	2008
dc.description.abstract-translated	Structural metadata extraction (MDE) research aims to develop techniques for automatic conversion of raw speech recognition output to forms that are more useful to humans and to downstream automatic processes. It may be achieved by inserting boundaries of syntactic/ semantic units to the flow of speech, labeling non-content words like filled pauses and discourse markers for optional removal, and identifying sections of disfluent speech. This paper compares two Czech MDE speech corpora – one in the domain of broadcast news and the other in the domain of broadcast conversations. A variety of statistics about fillers, edit disfluencies, and syntactic/semantic units are presented. Among many others, we report the statistics indicating that disfluent portions of speech show differences in the distribution of parts of speech (POS) of their word content in comparison with the overall POS distribution. The two Czech corpora are not only compared with each other, but also with available statistics relating to English MDE corpora of broadcast news and telephone conversations.	en
dc.format	6 s.	cs
dc.format.mimetype	application/pdf
dc.identifier.citation	KOLÁŘ, Jáchym; ŠVEC, Jan. Structural metadata annotation of speech corpora: comparing broadcast news and broadcast conversations. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08): 28-29-30 May 2008. Marrakech: ELRA, 2008, p. [1-6]. ISBN 2-9517408-4-0.	en
dc.identifier.isbn	2-9517408-4-0
dc.identifier.uri	http://www.kky.zcu.cz/cs/publications/KolarJ_2008_StructuralMetadata
dc.identifier.uri	http://hdl.handle.net/11025/17111
dc.language.iso	en	en
dc.publisher	ELRA	en
dc.rights	© Jáchym Kolář - Jan Švec	cs
dc.rights.access	openAccess	en
dc.subject	extrakce stukturálních metadat	cs
dc.subject	automatická konverze řeči	cs
dc.subject	řečový korpus	cs
dc.subject.translated	structural metadata extraction	en
dc.subject.translated	automatic conversion of speech	en
dc.subject.translated	speech corpora	en
dc.title	Structural metadata annotation of speech corpora: comparing broadcast news and broadcast conversations	en
dc.type	článek	cs
dc.type	article	en
dc.type.status	Peer-reviewed	en
dc.type.version	publishedVersion	en

Files

Original bundle

Showing 1 - 1 out of 1 results

Name:: KolarJ_2008_StructuralMetadata.pdf
Size:: 80.14 KB
Format:: Adobe Portable Document Format
Description:: Plný text

Download

License bundle

Showing 1 - 1 out of 1 results

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Articles (KKY)