datatrove icon indicating copy to clipboard operation
datatrove copied to clipboard

datatrove fails to handle tasks >1k with slurm job arrays

Open stas00 opened this issue 7 months ago • 25 comments

If I have more tasks than 1k, datatrove splits it into multiple job arrays 1k-each.

the first job array of 1k runs fine, the subsequent ones all fail

0: 2024-07-04 23:59:34.496 | ERROR    | datatrove.executor.base:_run_for_rank:108 - list index out of range
0: Traceback (most recent call last):
0: 
0:   File "/env/lib/conda/ctx-shared/bin/launch_pickled_pipeline", line 8, in <module>
0:     sys.exit(main())
0:     │   │    └ <function main at 0x7f4b5f9fd6c0>
0:     │   └ <built-in function exit>
0:     └ <module 'sys' (built-in)>
0:   File "/env/lib/conda/ctx-shared/lib/python3.10/site-packages/datatrove/tools/launch_pickled_pipeline.py", line 18, in main
0:     executor.run()
0:     │        └ <function SlurmPipelineExecutor.run at 0x7f4b5ccee4d0>
0:     └ <datatrove.executor.slurm.SlurmPipelineExecutor object at 0x7f4b5ccb7c10>
0:   File "/env/lib/conda/ctx-shared/lib/python3.10/site-packages/datatrove/executor/slurm.py", line 180, in run
0:     self._run_for_rank(rank)
0:     │    │             └ 7113
0:     │    └ <function PipelineExecutor._run_for_rank at 0x7f4b5ccedb40>
0:     └ <datatrove.executor.slurm.SlurmPipelineExecutor object at 0x7f4b5ccb7c10>
0: > File "/env/lib/conda/ctx-shared/lib/python3.10/site-packages/datatrove/executor/base.py", line 96, in _run_for_rank
0:     deque(pipelined_data, maxlen=0)
0:     │     └ <generator object DiskWriter.run at 0x7f4a9570eab0>
0:     └ <class 'collections.deque'>
0:   File "/env/lib/conda/ctx-shared/lib/python3.10/site-packages/datatrove/pipeline/writers/disk_base.py", line 176, in run
0:     for document in data:
0:                     └ <generator object BaseFilter.run at 0x7f4a9570dcb0>
0:   File "/env/lib/conda/ctx-shared/lib/python3.10/site-packages/datatrove/pipeline/filters/base_filter.py", line 47, in run
0:     for doc in data:
0:                └ <generator object HuggingFaceDatasetReader.run at 0x7f4a9ad4edc0>
0:   File "/env/lib/conda/ctx-shared/lib/python3.10/site-packages/datatrove/pipeline/readers/huggingface.py", line 96, in run
0:     shard = self._get_dataset_shard(ds, rank, world_size)
0:             │    │                  │   │     └ 100000
0:             │    │                  │   └ 7113
0:             │    │                  └ IterableDataset({
0:             │    │                        features: ['text', 'id', 'dump', 'url', 'date', 'file_path', 'language', 'language_score', 'token_count...
0:             │    └ <function HuggingFaceDatasetReader._get_dataset_shard at 0x7f4b5ccef5b0>
0:             └ 📖 - READER: 🤗 HuggingFace
0:   File "/env/lib/conda/ctx-shared/lib/python3.10/site-packages/datatrove/pipeline/readers/huggingface.py", line 69, in _get_dataset_shard
0:     ex_iterable = dst._ex_iterable.shard_data_sources(rank, world_size)
0:                   │   │            │                  │     └ 100000
0:                   │   │            │                  └ 7113
0:                   │   │            └ <function ArrowExamplesIterable.shard_data_sources at 0x7f48750bcaf0>
0:                   │   └ <datasets.iterable_dataset.ArrowExamplesIterable object at 0x7f47a012ea10>
0:                   └ IterableDataset({
0:                         features: ['text', 'id', 'dump', 'url', 'date', 'file_path', 'language', 'language_score', 'token_count...
0:   File "/env/lib/conda/ctx-shared/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 298, in shard_data_sources
0:     requested_gen_kwargs = _merge_gen_kwargs([gen_kwargs_list[i] for i in shard_indices])
0:                            │                  │                           └ []
0:                            │                  └ [{'files': [<datasets.download.streaming_download_manager.FilesIterable object at 0x7f4b4dd99fc0>]}, {'files': [<datasets.dow...
0:                            └ <function _merge_gen_kwargs at 0x7f487508f880>
0:   File "/env/lib/conda/ctx-shared/lib/python3.10/site-packages/datasets/utils/sharding.py", line 76, in _merge_gen_kwargs
0:     for key in gen_kwargs_list[0]
0:                └ []
0:
0: IndexError: list index out of range
0: Traceback (most recent call last):
0:   File "/env/lib/conda/ctx-shared/bin/launch_pickled_pipeline", line 8, in <module>
0:     sys.exit(main())
0:   File "/env/lib/conda/ctx-shared/lib/python3.10/site-packages/datatrove/tools/launch_pickled_pipeline.py", line 18, in main
0:     executor.run()
0:   File "/env/lib/conda/ctx-shared/lib/python3.10/site-packages/datatrove/executor/slurm.py", line 180, in run
0:     self._run_for_rank(rank)
0:   File "/env/lib/conda/ctx-shared/lib/python3.10/site-packages/datatrove/executor/base.py", line 109, in _run_for_rank
0:     raise e
0:   File "/env/lib/conda/ctx-shared/lib/python3.10/site-packages/datatrove/executor/base.py", line 96, in _run_for_rank
0:     deque(pipelined_data, maxlen=0)
0:   File "/env/lib/conda/ctx-shared/lib/python3.10/site-packages/datatrove/pipeline/writers/disk_base.py", line 176, in run
0:     for document in data:
0:   File "/env/lib/conda/ctx-shared/lib/python3.10/site-packages/datatrove/pipeline/filters/base_filter.py", line 47, in run
0:     for doc in data:
0:   File "/env/lib/conda/ctx-shared/lib/python3.10/site-packages/datatrove/pipeline/readers/huggingface.py", line 96, in run
0:     shard = self._get_dataset_shard(ds, rank, world_size)
0:   File "/env/lib/conda/ctx-shared/lib/python3.10/site-packages/datatrove/pipeline/readers/huggingface.py", line 69, in _get_dataset_shard
0:     ex_iterable = dst._ex_iterable.shard_data_sources(rank, world_size)
0:   File "/env/lib/conda/ctx-shared/lib/python3.10/site-packages/datasets/iterable_dataset.py", line 298, in shard_data_sources
0:     requested_gen_kwargs = _merge_gen_kwargs([gen_kwargs_list[i] for i in shard_indices])
0:   File "/env/lib/conda/ctx-shared/lib/python3.10/site-packages/datasets/utils/sharding.py", line 76, in _merge_gen_kwargs
0:     for key in gen_kwargs_list[0]
0: IndexError: list index out of range
srun: error: dojo-a3-ghpc-41: task 0: Exited with exit code 1

This failing behavior is consistent

This is with datatrove@main - I can't use the official release as it doesn't support datasets streaming.

This is just doing a slightly modified FineWeb filter from the example.

stas00 avatar Jul 05 '24 00:07 stas00