Who is Selling to Whom – Feature Evaluation for Multi-block Classification in Invoice Information Extraction

Varování

Publikace nespadá pod Ekonomicko-správní fakultu, ale pod Fakultu informatiky. Oficiální stránka publikace je na webu muni.cz.
Autoři

HA Hien Thi HORÁK Aleš

Rok publikování 2021
Druh Článek ve sborníku
Konference SPECOM 2021: 23rd International Conference on Speech and Computer
Fakulta / Pracoviště MU

Fakulta informatiky

Citace
www https://link.springer.com/chapter/10.1007/978-3-030-87802-3_23
Doi http://dx.doi.org/10.1007/978-3-030-87802-3_23
Klíčová slova OCR; Invoice; Block type classification; Seller; Buyer; Delivery address
Popis The invoice information extraction task aims at unifying the automatized processing of invoices in structured forms and in the form of a scanned image. Recognizing the pieces of information where a specific value is identified with a keyword (such as the invoice date) is a relatively well-managed task. On the other hand, identification of multi-block information on the invoice, such as distinguishing the seller, buyer, and the delivery address, is much more challenging due to versatile invoice layouts. In this work, we present a new technique of feature extraction and classification to recognize the seller, buyer, and delivery address text blocks in scanned invoices based on a combination of complex layout and annotated text features. The method does not only consider the block positional features but also the relation between blocks and block contents at a higher level. The technique is implemented as a module of the OCRMiner system. We offer its detailed evaluation and error analysis with a dataset of more than five hundred Czech invoices reaching the overall macro average F1-score of 94%.
Související projekty:

Používáte starou verzi internetového prohlížeče. Doporučujeme aktualizovat Váš prohlížeč na nejnovější verzi.