Filtering Very Similar Text Documents: A Case Study
Authors | |
---|---|
Year of publication | 2004 |
Type | Article in Proceedings |
Conference | Computational linguistics and Intelligent Text Processing |
MU Faculty or unit | |
Citation | |
Field | Informatics |
Keywords | machine learning; text categorization; text filtration; text similarity |
Description | This paper describes problems with classification and filtration of similar relevant and irrelevant real medical documents from one very specific domain, obtained from the Internet resources. Besides the similarity, the documents are often unbalanced-a lack of irrelevant documents for the training. A definition of similarity is suggested. For the classification, six algorithms are tested from the document similarity point of view. The best results are provided by the back propagation-based neural network and by the radial basis function-based support vector machine. |
Related projects: |