Segmentation from 97% to 100%: Is It Time for Some Linguistics?
Název česky | Segmentace z 97% na 100%: není čas pro trochu lingvistiky? |
---|---|
Autoři | |
Rok publikování | 2012 |
Druh | Článek ve sborníku |
Konference | Sixth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2012 |
Fakulta / Pracoviště MU | |
Citace | |
www | |
Obor | Informatika |
Klíčová slova | competing patterns;segmentation;hyphenation;NP problems;pattern generation;patgen;context-sensitive patterns;machine learning;natural language engineering;EuDML |
Popis | Many tasks in natural language processing (NLP) require \emph{segmentation} algorithms: segmentation of paragraph into sentences, segmentation of sentences into words is needed in languages like Chinese or Thai, segmentation of words into syllables (\emph{hyphenation}) or into morphological parts (e.g.\ getting word stem for indexing), and many other tasks (e.g.\ tagging) could be formulated as segmentation problems. We evaluate methodology of using \emph{competing patterns} for these tasks and decide on the complexity of creation of space-optimal (minimal) patterns that completely (100\,\%) implement the segmentation task. We formally define this task and prove that it is in the class of \emph{non-polynomial} optimization problems. However, finding space-efficient competing patterns for real NLP tasks is feasible and gives efficient scalable solutions of segmentation task: segmentation is done in \emph{constant} time with respect to the size of segmented dictionary. Constant time of access to segmentations makes competing patterns attractive data structure for many NLP tasks. |
Související projekty: |