Minhash Deduplication Between Two Datasets
Hello,
I am trying to perform Minhash-based deduplication between two datasets: an existing dataset and a new dataset. The goal is to remove documents from the new dataset if they are similar to those in the existing dataset.
Currently, I’m following the steps below to perform deduplication: https://colab.research.google.com/drive/1_nNRm8lc7KjGfj5K4UWemkis8uKfjQcz?usp=sharing
Does this approach make sense for cross-dataset deduplication?
Additionally, when I examine the generated .dups files after running the pipeline, I can identify document IDs from the new dataset. However, the corresponding document IDs from the existing dataset always appear as 4294967295, which I believe corresponds to a sentinel value (0xFFFFFFFF). Because of this, I cannot trace which document in the existing dataset matched.
Is there a way to retrieve or output the actual document IDs from the existing dataset in the .dups file or elsewhere?
Any help or guidance would be greatly appreciated. Thank you!
Hi.
Do you find a way to deal with this?