datatrove issues

datatrove fails to handle tasks >1k with slurm job arrays

25

If I have more tasks than 1k, datatrove splits it into multiple job arrays 1k-each. the first job array of 1k runs fine, the subsequent ones all fail ``` 0:...

stas00

adding an automatic log file `tail` under slurm executor

1

When using a local executor the running logs appear right away, in the console it was launched from. But when using slurm one has to fish for the log files....

stas00

Slurm Cluster Size for FineWeb

I am working on a pipeline similar to FineWeb, and the time has come to start scaling up. I am curious: what were the specs for the Slurm cluster used...

axelmagn

Batched filter inputs?

6

This is a very cool library! Kudos to the authors! The Filter API seems to be only working with a single item at a time. Is there a way to...

stas00

how to postpone filter init till it's running

5

So it appears that currently I can't instantiate a model on a gpu because the filter object is created by the launcher, which either doesn't have a gpu, or it...

stas00

LocalPipelineExecutor does not use cpu cores

2

I am trying to process a CC dump using the LocalPipelineExecutor. My setup includes 6 files in the dump and a VM with 48 CPU cores. I run the code...

elifssamplespace

Add shuffle option on huggingface reader

1

Added shuffle option on huggingface reader and also test code for shuffle. Before merge this commit, plz check the fixed seed value and also buffer size

justHungryMan

MinhashDedup should use the language parameter

https://github.com/huggingface/datatrove/blob/1e27cc8819465d5246d89cd929423b76eb0bc5dd/src/datatrove/pipeline/dedup/minhash.py#L196

nrv

Assign more cpu to single task to speed it up for local executor?

5

I am using the local executor. My machine has 48 Cpus with 348 Ram. Any idea how to speed this up? Currently one single task (task=1, running for 1 warc.gz...

barbara-su

minhash dedup causes local machine to hang.

4

Currently my goal is to deduplicate **~750GB text (around 750 jsonl files, each is 1GB)**. My machine has **1TB RAM, 256 CPU cores**. I used the following config to run...

staticpunch

datatrove
datatrove copied to clipboard

Metadata

datatrove fails to handle tasks >1k with slurm job arrays

adding an automatic log file `tail` under slurm executor

Slurm Cluster Size for FineWeb

Batched filter inputs?

how to postpone filter init till it's running

LocalPipelineExecutor does not use cpu cores

Add shuffle option on huggingface reader

MinhashDedup should use the language parameter

Assign more cpu to single task to speed it up for local executor?

minhash dedup causes local machine to hang.

← Metadata

Owner

Metadata

datatrove datatrove copied to clipboard

Metadata

← Metadata

Owner

Metadata

datatrove
datatrove copied to clipboard