text-dedup
text-dedup copied to clipboard
All-in-one text de-duplication
ModuleNotFoundError: No module named 'text_dedup.embedders' when "from text_dedup.embedders.minhash import MinHashEmbedder"
How to set a proper num of max_iteration to get a better trade-off between efficiency and accuracy? As normally when we're using spark clusters, we may deal with TBs of...
when i run the minhash_spark.py with spark submit, the program exit with this bug. I search the keywork in whole project and do not find any related code and also...
Currently, there is no real way to import deduplication algorithm and use it as a dependency in my python code without almost totally rewriting the content of main (the code...
- https://github.com/ChenghaoMou/text-dedup/blob/main/text_dedup/minhash_spark.py - reference:https://github.com/alibaba/data-juicer/blob/main/data_juicer/core/ray_executor.py - Ray is simpler and faster than Spark
Hi @ChenghaoMou , I have been using `minhash_spark.py` via GCP dataproc (removed all the code present for `bigcode`) for deduplicating my multi-lingual dataset. To get an understanding on the reproducibility...
数据读取失败
![image](https://github.com/ChenghaoMou/text-dedup/assets/31037013/616761bd-18ed-4028-ac9b-a2bec2297841)
When I use the Spark cluster to execute minhash_spark.py, I occasionally encounter [UNABLE-TO-INFER-SCHEMA] errors, as shown in the following figure. I don't know if it's a problem with the data....