Daft
Daft copied to clipboard
Can you provide an example of large-scale text deduplication, such as the following example
- https://github.com/ChenghaoMou/text-dedup/blob/main/text_dedup/minhash_spark.py
- https://github.com/phdinds-aim/alis/blob/68c7f56a08fa5cfe10638ea45292914620c9f5cf/notebooks/lsh-for-minhash/05_demo_minhash_lsh.ipynb
- https://github.com/NVIDIA/NeMo-Curator/blob/main/nemo_curator/scripts/fuzzy_deduplication/README.md
- https://xorbits.io/blogs/text-deduplicate
Great idea! Let me work on something :)