text-dedup issues

ModuleNotFoundError: No module named 'text_dedup.embedders'

6

ModuleNotFoundError: No module named 'text_dedup.embedders' when "from text_dedup.embedders.minhash import MinHashEmbedder"

done520

The max_iteration for small star and large star in minhash_spark.py

3

How to set a proper num of max_iteration to get a better trade-off between efficiency and accuracy? As normally when we're using spark clusters, we may deal with TBs of...

Jason3900

no-issue-activity

no module named numpy._typing

1

when i run the minhash_spark.py with spark submit, the program exit with this bug. I search the keywork in whole project and do not find any related code and also...

Leoooooo123

Little refactor to allow imports from python instead of cli/subprocess

2

Currently, there is no real way to import deduplication algorithm and use it as a dependency in my python code without almost totally rewriting the content of main (the code...

wuodar

how about make a ray executor to deduplication

1

- https://github.com/ChenghaoMou/text-dedup/blob/main/text_dedup/minhash_spark.py - reference：https://github.com/alibaba/data-juicer/blob/main/data_juicer/core/ray_executor.py - Ray is simpler and faster than Spark

simplew2011

Consistently seeing more rows being dropped in minhash_spark.py compared to minhash.py

8

Hi @ChenghaoMou , I have been using `minhash_spark.py` via GCP dataproc (removed all the code present for `bigcode`) for deduplicating my multi-lingual dataset. To get an understanding on the reproducibility...

prikmm

no-issue-activity

数据读取失败

3

![image](https://github.com/ChenghaoMou/text-dedup/assets/31037013/616761bd-18ed-4028-ac9b-a2bec2297841)

programmerLY

minhash_spark.py [UNABLE_TO_INFER_SCHEMA]

2

When I use the Spark cluster to execute minhash_spark.py, I occasionally encounter [UNABLE-TO-INFER-SCHEMA] errors, as shown in the following figure. I don't know if it's a problem with the data....

Yang-QW

text-dedup
text-dedup copied to clipboard

Metadata

ModuleNotFoundError: No module named 'text_dedup.embedders'

The max_iteration for small star and large star in minhash_spark.py

no module named numpy._typing

Little refactor to allow imports from python instead of cli/subprocess

how about make a ray executor to deduplication

Consistently seeing more rows being dropped in minhash_spark.py compared to minhash.py

数据读取失败

minhash_spark.py [UNABLE_TO_INFER_SCHEMA]

← Metadata

Owner

Metadata

text-dedup text-dedup copied to clipboard

Metadata

← Metadata

Owner

Metadata

text-dedup
text-dedup copied to clipboard