text-dedup
text-dedup copied to clipboard
The max_iteration for small star and large star in minhash_spark.py
How to set a proper num of max_iteration to get a better trade-off between efficiency and accuracy? As normally when we're using spark clusters, we may deal with TBs of data which is time consuming.
Good question. Honestly, it really depends on the data skewness. With that being said, I rarely see iteration goes beyond 10 with large datasets. You are welcome to try the code from the directory ./bigcode-v2/intra_dedup.py
, which uses graphframe that has produced faster convergence in my tests.
Thanks! I'll try it.
Stale issue message