datatrove icon indicating copy to clipboard operation
datatrove copied to clipboard

Minhash Deduplication Between Two Datasets

Open yjha9649 opened this issue 6 months ago • 1 comments

Hello,

I am trying to perform Minhash-based deduplication between two datasets: an existing dataset and a new dataset. The goal is to remove documents from the new dataset if they are similar to those in the existing dataset.

Currently, I’m following the steps below to perform deduplication: https://colab.research.google.com/drive/1_nNRm8lc7KjGfj5K4UWemkis8uKfjQcz?usp=sharing

Does this approach make sense for cross-dataset deduplication?

Additionally, when I examine the generated .dups files after running the pipeline, I can identify document IDs from the new dataset. However, the corresponding document IDs from the existing dataset always appear as 4294967295, which I believe corresponds to a sentinel value (0xFFFFFFFF). Because of this, I cannot trace which document in the existing dataset matched.

Is there a way to retrieve or output the actual document IDs from the existing dataset in the .dups file or elsewhere?

Any help or guidance would be greatly appreciated. Thank you!

yjha9649 avatar Jun 12 '25 08:06 yjha9649

Hi.

Do you find a way to deal with this?

fabiancpl avatar Jul 07 '25 14:07 fabiancpl