Scaling to Billion-plus Word Corpora

Warning

This publication doesn't include Faculty of Economics and Administration. It includes Faculty of Informatics. Official publication website can be found on muni.cz.
Authors

POMIKÁLEK Jan RYCHLÝ Pavel KILGARRIFF Adam

Year of publication 2009
Type Article in Periodical
Magazine / Source Advances in Computational Linguistics
MU Faculty or unit

Faculty of Informatics

Citation
Field Informatics
Keywords word corpora; web as corpus; duplicate detection
Description Most phenomena in natural languages are distributed in accordance with Zipf's law, so many words, phrases and other items occur rarely and we need very large corpora to provide evidence about them. Previous work shows that it is possible to create very large (multi-billion word) corpora from the web. The usability of such corpora is often limited by duplicate contents and a lack of efficient query tools. This paper describes BiWeC, a Big Web Corpus of English texts currently comprising 5.5b words fully processed, and with a target size of 20b. We present a method for detecting near-duplicate text documents in multi-billion-word text collections and describe how one corpus query tool, the Sketch Engine, has been re-engineered to efficiently encode, process and query such corpora on low-cost hardware.
Related projects:

You are running an old browser version. We recommend updating your browser to its latest version.