datatrove icon indicating copy to clipboard operation
datatrove copied to clipboard

Flexibility in minhash dedup by index

Open jordane95 opened this issue 11 months ago • 8 comments

Could we add a new argument to specific whether we want to dedup by index? In some case, we only want to dedup by itself and construct the index (say we want to run 10 tasks in parallel), then run the dedup by index in later tasks.

It seems that the hash index of all datasets must be stored in one folder, so subsequent dataset being processed must be deduped from all the index in the existing folder. Also we cannot specific from which index we want to dedup the current dataset.

jordane95 avatar Feb 28 '24 09:02 jordane95