Slavonic Corpus for Stylometry Research

Varování

Publikace nespadá pod Ekonomicko-správní fakultu, ale pod Fakultu informatiky. Oficiální stránka publikace je na webu muni.cz.

Autoři	ŠVEC Ján RYGL Jan
Rok publikování	2015
Druh	Článek ve sborníku
Konference	Proceedings of Ninth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2015.
Fakulta / Pracoviště MU	Fakulta informatiky
Citace
www	conference page article
Obor	Informatika
Klíčová slova	stylometry; slavonic corpus; web structure detection; corpora building
Popis	Stylometry techniques such as authorship recognition, machine translation detection and pedophile identification are daily used in applications for the most widely used languages. But under-represented languages lack data sources usable for stylometry research. In this paper, we propose an algorithm to build corpora containing meta-information required for stylometry experiments (author information, publication time, document heading, document borders) and introduce our tool Authorship Corpora Builder (ACB). We modify crawling and data-cleaning techniques for purposes of stylometry field and add heuristic layer to detect and extract meta-information. The system was used on Czech and Slovak web domains to build a Slavonic corpus for stylometry research. Collected data have been published and we are planning to build collections for other languages and gradually extend existing ones.
Související projekty:	Projekt LINDAT-Clarin - Vybudování a provoz českého uzlu pan-evropské infrastruktury pro výzkum