Stas Bekman

Results 664 comments of Stas Bekman

Is there a way to tell `datatrove` to pass `tasks=-1` and have it figure out how many tasks it needs? Assuming that dataset has `__len__` - but then we use...

Thank you for working with me on this, @guipenedo - I'm currently testing your fix The other weird thing about this splitting into 1k job arrays, is that it won't...

> for this particular config you are using, since there are only 250 files available, tasks>250 will not do anything I'm not sure what you mean by only 250 files...

> The dependencies for these arrays are "afterany:" and not "afterok:", so if one crashes the next job array should still launch. If you want them to run concurrently you...

To examples of datatrove+fineweb - it'd be very useful to have more of those Incidentally any idea why datatrove wasn't used for fineweb-edu and used HF Trainer instead? https://github.com/huggingface/cosmopedia/tree/main/classification It...

oh, and the other thing I noticed it creates one job array too many that never gets satisfied ``` JOBID PARTITION NAME STATE TIME TIME_LIM NODES START_TIME NODELIST(REASON) 98217_[0%128] a3mixed...

> I think the slurm file you linked is to just train the classifier, which only takes a relatively small number of annotated samples. Actually classifying all of FineWeb took...

ok, so I launched again with no `limit` on that single shard, how do I interpret this part of the log? ``` 4711it [12:06, 5.89it/s] 4768it [12:14, 7.48it/s]/s] 4812it [12:22,...

Also, this might be of interest to you: ``` 0: 2024-07-08 23:22:45.541 | INFO | datatrove.utils.logging:add_task_logger:58 - Launching pipeline for rank=0 0: 2024-07-08 23:22:45.541 | INFO | datatrove.utils.logging:log_pipeline:90 - 0:...

re: `jobs_status` - good to know 1. can it be documented please? 2. the output coloring is an issue again, the output is not readable as it assumes the dark...