gaoya
gaoya copied to clipboard
using dask for distributed deduplication.
reference:
- https://github.com/xorbitsai/xorbits/blob/main/python/xorbits/experimental/dedup.py
- https://github.com/ChenghaoMou/text-dedup/blob/main/text_dedup/minhash_spark.py
- https://github.com/FlagOpen/FlagData/blob/main/flagdata/deduplication/minhash.py