Korpusy jako zdroje dat pro úpravy nástrojů automatické morfologické analýzy (Slovotvorné varianty adjektiv na [(ou)|í]cí z hlediska morfologického značkování)
Title in English | Corpus as Source of Amendements for Automatic Morphological Analysis |
---|---|
Authors | |
Year of publication | 2014 |
Type | Appeared in Conference without Proceedings |
MU Faculty or unit | |
Citation | |
Description | Our goal is to present a corpus driven study of Czech gerund (verbal adjectives on -oucí/-ící). The link between inflective and word formation variants will be demonstrated on the material from corpus SYN (2,6 milliard tokens of written Czech) and large web corpus czTenTen12 (5,2 milliard tokens of Czech text from internet – cleaned and deduplicated). The adjectives on -oucí/-ící are regularly derived from verbs hence are not usually registered in Czech monolingual dictionaries. On the level of automatic morphological analysis of Czech they should be generated from verbal roots and tagged as verbal adjectives (pos tag). The data from Czech corpora prove a) the inconsistencies and b) the gaps in tagging. The main cause of both is the existence of variants on the level of verbal forms the verbal adjectives are potentially derived from. Consequently text corpora are a significant source of knowledge of the formation and usage of adjectives on -oucí/-ící, which can be a thing of importance for both a) an automatic morphological analysis of Czech and b) a theoretical description of Czech grammar (derivational morphology). |
Related projects: |