Chenghao Mou comments

Results 77 comments of


                                            Chenghao Mou

no module named numpy._typing

Can you try upgrading numpy version to `>=1.26.4`? with 1.22.3: ``` >>> from numpy._typing import * Traceback (most recent call last): File "", line 1, in ModuleNotFoundError: No module named...

The max_iteration for small star and large star in minhash_spark.py

Good question. Honestly, it really depends on the data skewness. With that being said, I rarely see iteration goes beyond 10 with large datasets. You are welcome to try the...

Consistently seeing more rows being dropped in minhash_spark.py compared to minhash.py

Hi @prikmm, Thanks for creating this PR. One thing that might explain the disparity — the num_perm is slightly different in the two scripts (256 vs 250), though only `b*r`...

Consistently seeing more rows being dropped in minhash_spark.py compared to minhash.py

I see. Could you provide any example data for me to reproduce the issue? Could you also share the exact command you use to run the scripts?

Consistently seeing more rows being dropped in minhash_spark.py compared to minhash.py

@jordane95 Can you share more details? Like the command or the log output? I took a look at the dataset you shared. The immediate observation is that that particular dataset...

Consistently seeing more rows being dropped in minhash_spark.py compared to minhash.py

We moved away from union find to spark implementation and then graphframe. Graphframe is used in the latest V2 (to be released): https://github.com/bigcode-project/bigcode-dataset/blob/main/near_deduplication/bigcode-v2/intra_dedup.py Without details, I can't offer much help.

Chenghao Mou

no module named numpy._typing

The max_iteration for small star and large star in minhash_spark.py

Consistently seeing more rows being dropped in minhash_spark.py compared to minhash.py

Consistently seeing more rows being dropped in minhash_spark.py compared to minhash.py

Consistently seeing more rows being dropped in minhash_spark.py compared to minhash.py

Consistently seeing more rows being dropped in minhash_spark.py compared to minhash.py

Little refactor to allow imports from python instead of cli/subprocess

Little refactor to allow imports from python instead of cli/subprocess

how about make a ray executor to deduplication

Has the official version of pqrnn been released?