text-dedup icon indicating copy to clipboard operation
text-dedup copied to clipboard

Consistently seeing more rows being dropped in minhash_spark.py compared to minhash.py

Open prikmm opened this issue 8 months ago • 8 comments

Hi @ChenghaoMou ,

I have been using minhash_spark.py via GCP dataproc (removed all the code present for bigcode) for deduplicating my multi-lingual dataset. To get an understanding on the reproducibility of the result, I also deduped the same multi-lingual dataset using minhash.py.

Currently, deduplication is performed on one individual language at a time.

When I ran this for the first language, I witnessed that minhash.py retained around 15-20% more documents as compared to minhash_spark.py. minhash_spark.py output had around ~12M documents and minhash.py output had around ~14.5M documents.

In #28 , you have mentioned that for same algorithm, although the documents being removed is random, the number of documents being removed is same. But, I am witnessing different behaviour.

To validate my experience, I ran the deduplication over rest of the language subsets and found more documents were being dropped in minhash_spark.py.

It would be great if you can help me better understand this by answering a few question:

  1. Does connected components used in minhash_spark.py create different clusters than union-find used in minhash.py?
  2. If the number of clusters are same then shouldn't the number of samples in the outputs for both the scripts be same?
  3. Is running the scripts on different machines responsible for this behaviour? If yes, what is the reason for this behaviour.

I would be grateful if you can share any info apart from the above questions which can help me troubleshoot this behaviour!

Thanks.

prikmm avatar Oct 27 '23 20:10 prikmm