datatrove
datatrove copied to clipboard
Flexibility in minhash dedup by index
Could we add a new argument to specific whether we want to dedup by index? In some case, we only want to dedup by itself and construct the index (say we want to run 10 tasks in parallel), then run the dedup by index in later tasks.
It seems that the hash index of all datasets must be stored in one folder, so subsequent dataset being processed must be deduped from all the index in the existing folder. Also we cannot specific from which index we want to dedup the current dataset.