Slavonic Corpus for Stylometry Research
Authors | |
---|---|
Year of publication | 2015 |
Type | Article in Proceedings |
Conference | Proceedings of Ninth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2015. |
MU Faculty or unit | |
Citation | |
Web | |
Field | Informatics |
Keywords | stylometry; slavonic corpus; web structure detection; corpora building |
Description | Stylometry techniques such as authorship recognition, machine translation detection and pedophile identification are daily used in applications for the most widely used languages. But under-represented languages lack data sources usable for stylometry research. In this paper, we propose an algorithm to build corpora containing meta-information required for stylometry experiments (author information, publication time, document heading, document borders) and introduce our tool Authorship Corpora Builder (ACB). We modify crawling and data-cleaning techniques for purposes of stylometry field and add heuristic layer to detect and extract meta-information. The system was used on Czech and Slovak web domains to build a Slavonic corpus for stylometry research. Collected data have been published and we are planning to build collections for other languages and gradually extend existing ones. |
Related projects: |