Utilizing Linguistic Resources: Theory and Practical Experience

Němčík,  Václav

Utilizing Linguistic Resources: Theory and Practical Experience

Warning

This publication doesn't include Faculty of Economics and Administration. It includes Faculty of Informatics. Official publication website can be found on muni.cz.

Authors	NĚMČÍK Václav
Year of publication	2010
Type	Article in Proceedings
Conference	Proceedings of Recent Advances in Slavonic Natural Language Processing 2010
MU Faculty or unit	Faculty of Informatics
Citation
Web	https://nlp.fi.muni.cz/raslan/2010/paper04.pdf
Field	Informatics
Keywords	linguistic resources; corpora; theory; practice
Description	The Prague Dependency Treebank (henceforth PDT) is a large collection of texts in Czech. It contains several layers of rich annotation, ranging from morphology to deep syntax. It is unique in its size and theoretical background, especially for a language like Czech, which can be, with regard to the number of its speakers, considered a small language. In this article, we use PDT 2.0 to demonstrate that within real NLP systems, complex annotations may cut both ways. We present several issues that might pose problems when extracting data from PDT, and complex structures in general, and hint on possible solutions.
Related projects:	Centrum komputační lingvistiky Prostředky tvorby komplexní báze znalostí pro komunikaci se sémantickým webem v přirozeném jazyce