text-dedup icon indicating copy to clipboard operation
text-dedup copied to clipboard

All-in-one text de-duplication

Results 8 text-dedup issues
Sort by recently updated
recently updated
newest added

ModuleNotFoundError: No module named 'text_dedup.embedders' when "from text_dedup.embedders.minhash import MinHashEmbedder"

How to set a proper num of max_iteration to get a better trade-off between efficiency and accuracy? As normally when we're using spark clusters, we may deal with TBs of...

no-issue-activity

when i run the minhash_spark.py with spark submit, the program exit with this bug. I search the keywork in whole project and do not find any related code and also...

Currently, there is no real way to import deduplication algorithm and use it as a dependency in my python code without almost totally rewriting the content of main (the code...

- https://github.com/ChenghaoMou/text-dedup/blob/main/text_dedup/minhash_spark.py - reference:https://github.com/alibaba/data-juicer/blob/main/data_juicer/core/ray_executor.py - Ray is simpler and faster than Spark

Hi @ChenghaoMou , I have been using `minhash_spark.py` via GCP dataproc (removed all the code present for `bigcode`) for deduplicating my multi-lingual dataset. To get an understanding on the reproducibility...

no-issue-activity

![image](https://github.com/ChenghaoMou/text-dedup/assets/31037013/616761bd-18ed-4028-ac9b-a2bec2297841)

When I use the Spark cluster to execute minhash_spark.py, I occasionally encounter [UNABLE-TO-INFER-SCHEMA] errors, as shown in the following figure. I don't know if it's a problem with the data....