simplew2011 issues

Results 67 issues of


                                            simplew2011

pretrained weight

can you release your pretrained weight ? thanks.

https://github.com/huggingface/cosmopedia/blob/main/deduplication/deduplicate_dataset.py ``` 2024-02-22 14:17:57.759 | INFO | datatrove.executor.slurm:launch_job:216 - Launching dependency job "mh3" 2024-02-22 14:17:57.759 | INFO | datatrove.executor.slurm:launch_job:216 - Launching dependency job "mh2" 2024-02-22 14:17:57.759 | INFO | datatrove.executor.slurm:launch_job:216...

cache in nfs error

### Describe the bug - When reading dataset, a cache will be generated to the ~/. cache/huggingface/datasets directory - When using .map and .filter operations, runtime cache will be generated...

a tutorial for distributed text deduplication

Can you provide an example of distributed text deduplication based on dask, such as： - https://github.com/xorbitsai/xorbits/blob/main/python/xorbits/experimental/dedup.py - https://github.com/ChenghaoMou/text-dedup/blob/main/text_dedup/minhash_spark.py - https://github.com/FlagOpen/FlagData/blob/main/flagdata/deduplication/minhash.py

documentation

Can we consider using dask for distributed deduplication

- https://github.com/NVIDIA/NeMo-Curator/tree/main/nemo_curator/scripts/fuzzy_deduplication

why clamp attn_weights with min-max

- https://github.com/IDEA-Research/GroundingDINO/blob/main/groundingdino/models/GroundingDINO/fuse_modules.py#L184 - Is it possible that this is the reason for the accuracy drop of TensorRT-FP16

Is the code no longer updated?

I haven't seen any code submissions recently.

pip install gaoya

- pip install gaoya - only release 0.2.0 version in pypi - github code in __version__ = "0.1.3" - https://pypi.org/project/gaoya/

using dask for distributed deduplication.

reference： - https://github.com/xorbitsai/xorbits/blob/main/python/xorbits/experimental/dedup.py - https://github.com/ChenghaoMou/text-dedup/blob/main/text_dedup/minhash_spark.py - https://github.com/FlagOpen/FlagData/blob/main/flagdata/deduplication/minhash.py

Is there a plan to process data in sft format

- sft dataset support.