text-dedup The max_iteration for small star and large star in minhash

The max_iteration for small star and large star in minhash_spark.py

Open Jason3900 opened this issue 1 year ago • 3 comments

How to set a proper num of max_iteration to get a better trade-off between efficiency and accuracy? As normally when we're using spark clusters, we may deal with TBs of data which is time consuming.

Feb 07 '24 09:02 Jason3900

Good question. Honestly, it really depends on the data skewness. With that being said, I rarely see iteration goes beyond 10 with large datasets. You are welcome to try the code from the directory ./bigcode-v2/intra_dedup.py, which uses graphframe that has produced faster convergence in my tests.

Feb 08 '24 08:02 ChenghaoMou

Thanks! I'll try it.

Feb 09 '24 06:02 Jason3900

Stale issue message

Apr 09 '24 17:04 github-actions[bot]

text-dedup text-dedup copied to clipboard

The max_iteration for small star and large star in minhash_spark.py

text-dedup
text-dedup copied to clipboard