Building a 50M Corpus of Tajik Language

Warning

This publication doesn't include Faculty of Economics and Administration. It includes Faculty of Informatics. Official publication website can be found on muni.cz.
Authors

DOVUDOV Gulshan POMIKÁLEK Jan SUCHOMEL Vít ŠMERK Pavel

Year of publication 2011
Type Article in Proceedings
Conference Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2011
MU Faculty or unit

Faculty of Informatics

Citation
Web https://nlp.fi.muni.cz/raslan/2011/paper07.pdf
Field Linguistics
Keywords language corpora; corpus; corpus building; tajik
Description Paper presents by far the largest available computer corpus of Tajik Language of the size of more than 50 million words. To obtain the texts for the corpus two different approaches were used. The paper brings a description of both of them, discusses their advantages and disadvantages and shows some statistics of the two respective partial corpora. Then the paper characterizes the resulting joined corpus and finally discusses some possible future improvements.
Related projects:

You are running an old browser version. We recommend updating your browser to its latest version.