simplew2011
simplew2011
can you release your pretrained weight ? thanks.
https://github.com/huggingface/cosmopedia/blob/main/deduplication/deduplicate_dataset.py ``` 2024-02-22 14:17:57.759 | INFO | datatrove.executor.slurm:launch_job:216 - Launching dependency job "mh3" 2024-02-22 14:17:57.759 | INFO | datatrove.executor.slurm:launch_job:216 - Launching dependency job "mh2" 2024-02-22 14:17:57.759 | INFO | datatrove.executor.slurm:launch_job:216...
### Describe the bug - When reading dataset, a cache will be generated to the ~/. cache/huggingface/datasets directory - When using .map and .filter operations, runtime cache will be generated...
Can you provide an example of distributed text deduplication based on dask, such as: - https://github.com/xorbitsai/xorbits/blob/main/python/xorbits/experimental/dedup.py - https://github.com/ChenghaoMou/text-dedup/blob/main/text_dedup/minhash_spark.py - https://github.com/FlagOpen/FlagData/blob/main/flagdata/deduplication/minhash.py
- https://github.com/NVIDIA/NeMo-Curator/tree/main/nemo_curator/scripts/fuzzy_deduplication
- https://github.com/IDEA-Research/GroundingDINO/blob/main/groundingdino/models/GroundingDINO/fuse_modules.py#L184 - Is it possible that this is the reason for the accuracy drop of TensorRT-FP16
I haven't seen any code submissions recently.
- pip install gaoya - only release 0.2.0 version in pypi - github code in __version__ = "0.1.3" - https://pypi.org/project/gaoya/
reference: - https://github.com/xorbitsai/xorbits/blob/main/python/xorbits/experimental/dedup.py - https://github.com/ChenghaoMou/text-dedup/blob/main/text_dedup/minhash_spark.py - https://github.com/FlagOpen/FlagData/blob/main/flagdata/deduplication/minhash.py
- sft dataset support.