Stas Bekman comments

Results 664 comments of


                                            Stas Bekman

datatrove fails to handle tasks >1k with slurm job arrays

`failed_logs` is different from `jobs_status` - seems to want the low-level subdir. i.e. `logs/slurm_processing/` instead of `logs` and it has the same issue with coloring. may I suggest to add...

datatrove fails to handle tasks >1k with slurm job arrays

@guipenedo, `max_array_launch_parallel=True` isn't doing what you proposed - it actually ends up running many more workers than configured. so if I submit 2k tasks and I set 128 workers, it'd...

datatrove fails to handle tasks >1k with slurm job arrays

And another thing I can't figure out. The 250 files job was fine, but then I run the same on the `CC-MAIN-2023-06` shard which proved to be much bigger. So...

datatrove fails to handle tasks >1k with slurm job arrays

it just keeps on going: `rank=3129` ``` cat /data/stas/classify/data/logs/slurm_processing/slurm_logs/118829_129.out Starting data processing job fin_class-CC-MAIN-2023-06 + export PYTHONUNBUFFERED=TRUE + PYTHONUNBUFFERED=TRUE + srun -l launch_pickled_pipeline /data/stas/classify/data/logs/slurm_processing/executor.pik 0: 2024-07-12 00:56:16.392 | INFO |...

adding an automatic log file `tail` under slurm executor

this seems to do the trick: ``` dist_executor.run() print(f"*** Find the slurm logs under: {root_dir}/logs/slurm_processing/slurm_logs/ ") if dist_executor.job_id != -1: print(f"tail -F {root_dir}/logs/slurm_processing/slurm_logs/{dist_executor.job_id}_0.out") ```

Batched filter inputs?

Hmm, possibly a test is needed? if I replace `filter` with `filter_batch` it now fails: `TypeError: Can't instantiate abstract class ClassifierFilter with abstract method filter` so I think it still...

Batched filter inputs?

ok, so I sorted out how to get the classifier to work on the gpus under pickle https://github.com/huggingface/datatrove/issues/242#issuecomment-2219303285 - and I got a 10x speed up, but the gpus are...

Batched filter inputs?

ok, I found a workaround, adding: ``` def filter(self, doc) -> bool | tuple[bool, str]: pass ``` I suppose that if `filter` is required then it should be defined in...

Batched filter inputs?

And now with batched filter, the reported it/s - what does it mean? is it batched it/s or something else? Is there a way to customize it to print the...

how to postpone filter init till it's running

I'm trying: ``` @functools.cached_property def device(self): return torch.device('cuda:0') @functools.cached_property def model(self): return ClassifierHead.from_pretrained(mname).to(self.device) @functools.cached_property def tokenizer(self): return AutoTokenizer.from_pretrained(mname) @functools.cached_property def config(self): return AutoConfig.from_pretrained(mname) ``` and then inside the filter only...