text-dedup
text-dedup copied to clipboard
Consistently seeing more rows being dropped in minhash_spark.py compared to minhash.py
Hi @ChenghaoMou ,
I have been using minhash_spark.py
via GCP dataproc (removed all the code present for bigcode
) for deduplicating my multi-lingual dataset. To get an understanding on the reproducibility of the result, I also deduped the same multi-lingual dataset using minhash.py
.
Currently, deduplication is performed on one individual language at a time.
When I ran this for the first language, I witnessed that minhash.py
retained around 15-20% more documents as compared to minhash_spark.py
.
minhash_spark.py
output had around ~12M documents and minhash.py
output had around ~14.5M documents.
In #28 , you have mentioned that for same algorithm, although the documents being removed is random, the number of documents being removed is same. But, I am witnessing different behaviour.
To validate my experience, I ran the deduplication over rest of the language subsets and found more documents were being dropped in minhash_spark.py
.
It would be great if you can help me better understand this by answering a few question:
- Does connected components used in
minhash_spark.py
create different clusters than union-find used inminhash.py
? - If the number of clusters are same then shouldn't the number of samples in the outputs for both the scripts be same?
- Is running the scripts on different machines responsible for this behaviour? If yes, what is the reason for this behaviour.
I would be grateful if you can share any info apart from the above questions which can help me troubleshoot this behaviour!
Thanks.