Mehran Maghoumi
Mehran Maghoumi
Besides the toy examples listed in the docs and tests, are there actual examples of this library available anywhere? I'm interested in using this library for a sequence labeling project,...
## Description This PR ensures that users can run the PEFT SDG tutorial using arbitrary API endpoints by exposing the URL that is used for synthetic data generation. ## Checklist...
**Describe the bug** When attempting to run fuzzy deduplication on a dataset that has no duplicates, the code errors out. **Steps/Code to reproduce bug** 1) Clone the repo 2) Run...
I've been running some large-scale benchmarking with minhash deduplication on SLURM clusters, loosely following [this example](https://github.com/huggingface/datatrove/blob/main/examples/minhash_deduplication.py) The benchmarks consist of running stages 1 and 2 with the following configurations: *...