datatrove
datatrove copied to clipboard
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
If I have more tasks than 1k, datatrove splits it into multiple job arrays 1k-each. the first job array of 1k runs fine, the subsequent ones all fail ``` 0:...
When using a local executor the running logs appear right away, in the console it was launched from. But when using slurm one has to fish for the log files....
I am working on a pipeline similar to FineWeb, and the time has come to start scaling up. I am curious: what were the specs for the Slurm cluster used...
This is a very cool library! Kudos to the authors! The Filter API seems to be only working with a single item at a time. Is there a way to...
So it appears that currently I can't instantiate a model on a gpu because the filter object is created by the launcher, which either doesn't have a gpu, or it...
I am trying to process a CC dump using the LocalPipelineExecutor. My setup includes 6 files in the dump and a VM with 48 CPU cores. I run the code...
Added shuffle option on huggingface reader and also test code for shuffle. Before merge this commit, plz check the fixed seed value and also buffer size
https://github.com/huggingface/datatrove/blob/1e27cc8819465d5246d89cd929423b76eb0bc5dd/src/datatrove/pipeline/dedup/minhash.py#L196
I am using the local executor. My machine has 48 Cpus with 348 Ram. Any idea how to speed this up? Currently one single task (task=1, running for 1 warc.gz...
Currently my goal is to deduplicate **~750GB text (around 750 jsonl files, each is 1GB)**. My machine has **1TB RAM, 256 CPU cores**. I used the following config to run...