datatrove
datatrove copied to clipboard
Exact deduplication
First of all, thank you for providing such an excellent repository. I would like to inquire if the repository supports exact deduplication. Thank you in advance.
Do you mean exact "document" deduplication? As in, remove documents that have their entire content exactly repeated?
Indeed, that is precisely the point I was intending to convey.
We currently don't support it out of the box. MinHash will also find those documents but that might be overkill if you only want exact matching. Will add to our to do list, but feel free to make a PR if you'd like to work on it