datatrove
datatrove copied to clipboard
datatrove fails to handle tasks >1k with slurm job arrays
If I have more tasks than 1k, datatrove splits it into multiple job arrays 1k-each.
the first job array of 1k runs fine, the subsequent ones all fail
0: 2024-07-04 23:59:34.496 | ERROR | datatrove.executor.base:_run_for_rank:108 - list index out of range
0: Traceback (most recent call last):
0:
0: File "/env/lib/conda/ctx-shared/bin/launch_pickled_pipeline", line 8, in <module>
0: sys.exit(main())
0: │ │ └ <function main at 0x7f4b5f9fd6c0>
0: │ └ <built-in function exit>
0: └ <module 'sys' (built-in)>
0: File "/env/lib/conda/ctx-shared/lib/python3.10/site-packages/datatrove/tools/launch_pickled_pipeline.py", line 18, in main
0: executor.run()
0: │ └ <function SlurmPipelineExecutor.run at 0x7f4b5ccee4d0>
0: └ <datatrove.executor.slurm.SlurmPipelineExecutor object at 0x7f4b5ccb7c10>
0: File "/env/lib/conda/ctx-shared/lib/python3.10/site-packages/datatrove/executor/slurm.py", line 180, in run
0: self._run_for_rank(rank)
0: │ │ └ 7113
0: │ └ <function PipelineExecutor._run_for_rank at 0x7f4b5ccedb40>
0: └ <datatrove.executor.slurm.SlurmPipelineExecutor object at 0x7f4b5ccb7c10>
0: > File "/env/lib/conda/ctx-shared/lib/python3.10/site-packages/datatrove/executor/base.py", line 96, in _run_for_rank
0: deque(pipelined_data, maxlen=0)
0: │ └ <generator object DiskWriter.run at 0x7f4a9570eab0>
0: └ <class 'collections.deque'>
0: File "/env/lib/conda/ctx-shared/lib/python3.10/site-packages/datatrove/pipeline/writers/disk_base.py", line 176, in run
0: for document in data:
0: └ <generator object BaseFilter.run at 0x7f4a9570dcb0>
0: File "/env/lib/conda/ctx-shared/lib/python3.10/site-packages/datatrove/pipeline/filters/base_filter.py", line 47, in run
0: for doc in data:
0: └ <generator object HuggingFaceDatasetReader.run at 0x7f4a9ad4edc0>
0: File "/env/lib/conda/ctx-shared/lib/python3.10/site-packages/datatrove/pipeline/readers/huggingface.py", line 96, in run
0: shard = self._get_dataset_shard(ds, rank, world_size)
0: │ │ │ │ └ 100000
0: │ │ │ └ 7113
0: │ │ └ IterableDataset({
0: │ │ features: ['text', 'id', 'dump', 'url', 'date', 'file_path', 'language', 'language_score', 'token_count...
0: │ └ <function HuggingFaceDatasetReader._get_dataset_shard at 0x7f4b5ccef5b0>
0: └ 📖 - READER: 🤗 HuggingFace
0: File "/env/lib/conda/ctx-shared/lib/python3.10/site-packages/datatrove/pipeline/readers/huggingface.py", line 69, in _get_dataset_shard
0: ex_iterable = dst._ex_iterable.shard_data_sources(rank, world_size)
0: │ │ │ │ └ 100000
0: │ │ │ └ 7113
0: │ │ └ <function ArrowExamplesIterable.shard_data_sources at 0x7f48750bcaf0>
0: │ └ <datasets.iterable_dataset.ArrowExamplesIterable object at 0x7f47a012ea10>
0: └ IterableDataset({
0: features: ['text', 'id', 'dump', 'url', 'date', 'file_path', 'language', 'language_score', 'token_count...
0: File "/env/lib/conda/ctx-shared/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 298, in shard_data_sources
0: requested_gen_kwargs = _merge_gen_kwargs([gen_kwargs_list[i] for i in shard_indices])
0: │ │ └ []
0: │ └ [{'files': [<datasets.download.streaming_download_manager.FilesIterable object at 0x7f4b4dd99fc0>]}, {'files': [<datasets.dow...
0: └ <function _merge_gen_kwargs at 0x7f487508f880>
0: File "/env/lib/conda/ctx-shared/lib/python3.10/site-packages/datasets/utils/sharding.py", line 76, in _merge_gen_kwargs
0: for key in gen_kwargs_list[0]
0: └ []
0:
0: IndexError: list index out of range
0: Traceback (most recent call last):
0: File "/env/lib/conda/ctx-shared/bin/launch_pickled_pipeline", line 8, in <module>
0: sys.exit(main())
0: File "/env/lib/conda/ctx-shared/lib/python3.10/site-packages/datatrove/tools/launch_pickled_pipeline.py", line 18, in main
0: executor.run()
0: File "/env/lib/conda/ctx-shared/lib/python3.10/site-packages/datatrove/executor/slurm.py", line 180, in run
0: self._run_for_rank(rank)
0: File "/env/lib/conda/ctx-shared/lib/python3.10/site-packages/datatrove/executor/base.py", line 109, in _run_for_rank
0: raise e
0: File "/env/lib/conda/ctx-shared/lib/python3.10/site-packages/datatrove/executor/base.py", line 96, in _run_for_rank
0: deque(pipelined_data, maxlen=0)
0: File "/env/lib/conda/ctx-shared/lib/python3.10/site-packages/datatrove/pipeline/writers/disk_base.py", line 176, in run
0: for document in data:
0: File "/env/lib/conda/ctx-shared/lib/python3.10/site-packages/datatrove/pipeline/filters/base_filter.py", line 47, in run
0: for doc in data:
0: File "/env/lib/conda/ctx-shared/lib/python3.10/site-packages/datatrove/pipeline/readers/huggingface.py", line 96, in run
0: shard = self._get_dataset_shard(ds, rank, world_size)
0: File "/env/lib/conda/ctx-shared/lib/python3.10/site-packages/datatrove/pipeline/readers/huggingface.py", line 69, in _get_dataset_shard
0: ex_iterable = dst._ex_iterable.shard_data_sources(rank, world_size)
0: File "/env/lib/conda/ctx-shared/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 298, in shard_data_sources
0: requested_gen_kwargs = _merge_gen_kwargs([gen_kwargs_list[i] for i in shard_indices])
0: File "/env/lib/conda/ctx-shared/lib/python3.10/site-packages/datasets/utils/sharding.py", line 76, in _merge_gen_kwargs
0: for key in gen_kwargs_list[0]
0: IndexError: list index out of range
srun: error: dojo-a3-ghpc-41: task 0: Exited with exit code 1
This failing behavior is consistent
This is with datatrove@main
- I can't use the official release as it doesn't support datasets streaming.
This is just doing a slightly modified FineWeb filter from the example.