text-dedup
text-dedup copied to clipboard
how about make a ray executor to deduplication
- https://github.com/ChenghaoMou/text-dedup/blob/main/text_dedup/minhash_spark.py
- reference:https://github.com/alibaba/data-juicer/blob/main/data_juicer/core/ray_executor.py
- Ray is simpler and faster than Spark