text-dedup icon indicating copy to clipboard operation
text-dedup copied to clipboard

The max_iteration for small star and large star in minhash_spark.py

Open Jason3900 opened this issue 1 year ago • 3 comments

How to set a proper num of max_iteration to get a better trade-off between efficiency and accuracy? As normally when we're using spark clusters, we may deal with TBs of data which is time consuming.

Jason3900 avatar Feb 07 '24 09:02 Jason3900

Good question. Honestly, it really depends on the data skewness. With that being said, I rarely see iteration goes beyond 10 with large datasets. You are welcome to try the code from the directory ./bigcode-v2/intra_dedup.py, which uses graphframe that has produced faster convergence in my tests.

ChenghaoMou avatar Feb 08 '24 08:02 ChenghaoMou

Thanks! I'll try it.

Jason3900 avatar Feb 09 '24 06:02 Jason3900

Stale issue message

github-actions[bot] avatar Apr 09 '24 17:04 github-actions[bot]