Chenghao Mou
Chenghao Mou
Can you try upgrading numpy version to `>=1.26.4`? with 1.22.3: ``` >>> from numpy._typing import * Traceback (most recent call last): File "", line 1, in ModuleNotFoundError: No module named...
Good question. Honestly, it really depends on the data skewness. With that being said, I rarely see iteration goes beyond 10 with large datasets. You are welcome to try the...
Hi @prikmm, Thanks for creating this PR. One thing that might explain the disparity — the num_perm is slightly different in the two scripts (256 vs 250), though only `b*r`...
I see. Could you provide any example data for me to reproduce the issue? Could you also share the exact command you use to run the scripts?
@jordane95 Can you share more details? Like the command or the log output? I took a look at the dataset you shared. The immediate observation is that that particular dataset...
We moved away from union find to spark implementation and then graphframe. Graphframe is used in the latest V2 (to be released): https://github.com/bigcode-project/bigcode-dataset/blob/main/near_deduplication/bigcode-v2/intra_dedup.py Without details, I can't offer much help.
It is definitely possible. But I will need some time testing them, as some scripts depend on global variables and specific multiprocessing setup.
Now it is possible to call each script's main function like this: ```python import click from text_dedup.bloom_filter import main as bf_main from text_dedup.utils import BloomFilterArgs from text_dedup.utils import IOArgs from...
Thanks for the suggestion! AFAIK, there is no comparable graph processing library for Ray, making it less ideal when processing large scale datasets for clustering, which can be a bottleneck...
To answer your question in the title: no, there is no official release of pqrnn. However, you might find open sourced version of PRADO and some realted code in https://github.com/tensorflow/models/tree/master/research/seq_flow_lite....