Do we need very large corpora?

Warning

This publication doesn't include Faculty of Economics and Administration. It includes Faculty of Informatics. Official publication website can be found on muni.cz.
Authors

PALA Karel RYCHLÝ Pavel

Year of publication 2011
Type Article in Proceedings
MU Faculty or unit

Faculty of Informatics

Citation
Field Informatics
Keywords corpora, corpus tools
Description In the paper we are dealing with building very large corpora from Web. First, we discuss motivation and needs for this kind of resources both for linguists, lexicographers, and NLP specialists. Second, we mention the techniques used for building large (more than billion tokens) corpora and present the results obtained at NLP Centre FI MU, i.e. both tools and corpora. Then we pay attention to the analysis of the consequences following from building large text data resources and the ways in which they are used in corpus linguistics and various NLP applications.
Related projects:

You are running an old browser version. We recommend updating your browser to its latest version.