Blooming Onion: Efficient Deduplication through Approximate Membership Testing
Autoři | |
---|---|
Rok publikování | 2022 |
Druh | Článek ve sborníku |
Konference | Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022 |
Fakulta / Pracoviště MU | |
Citace | |
www | |
Klíčová slova | deduplication; text corpora; Bloom filter |
Popis | Deduplication of source text is an important step in corpus building. Maximum corpus sizes have been grown significantly, along with the requirements for computing resources required for processing them. This article explores reducing the cost of deduplication by applying approximate membership testing using Bloom filtering. |
Související projekty: |