accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

Impossibility to use num_workers and prefetch_factor when using StatefulDataLoader (use_stateful_dataloader=True)

Open hkproj opened this issue 1 year ago • 1 comments

System Info

- `Accelerate` version: 0.34.2
- Platform: Linux-5.15.0-1057-aws-x86_64-with-glibc2.31
- `accelerate` bash location: /fsx/umar/miniconda3/envs/memory-efficient-transformers/bin/accelerate
- Python version: 3.10.14
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.3.1+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 1999.99 GB
- GPU type: NVIDIA H100 80GB HBM3
- `Accelerate` default config:
        Not found

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • [X] My own task or dataset (give details below)

Reproduction


dataset_streaming = True
ds_train = ... # Dataset loaded with streaming=True
train_batch_size = 12
collator = DataCollatorForLanguageModeling(...)
dataloader_num_workers = 4
dataloader_prefetch_factor = 10

dl_trainer = DataLoader(
        ds_train,
        batch_size=train_batch_size,
        collate_fn=collator,
        shuffle=not dataset_streaming,
        drop_last=True,
        num_workers=dataloader_num_workers,
        prefetch_factor=dataloader_prefetch_factor,
        pin_memory=True,
    )

model, optimizer, scheduler, dl_eval, dl_trainer = accelerator.prepare(
        model, optimizer, scheduler, dl_eval, dl_trainer
    )

for _, batch in enumerate(dl_trainer):
     training_loop()

A DataLoader initialized with num_workers results in the following errors when iterating through the wrapper DataLoader:

[rank0]:     for _, batch in batch_enumerator:
[rank0]:   File "/fsx/umar/miniconda3/envs/memory-efficient-transformers/lib/python3.10/site-packages/tqdm/std.py", line 1181, in __iter__
[rank0]:     for obj in iterable:
[rank0]:   File "/fsx/umar/miniconda3/envs/memory-efficient-transformers/lib/python3.10/site-packages/accelerate/data_loader.py", line 798, in __iter__
[rank0]:     next_batch, next_batch_info = self._fetch_batches(main_iterator)
[rank0]:   File "/fsx/umar/miniconda3/envs/memory-efficient-transformers/lib/python3.10/site-packages/accelerate/data_loader.py", line 751, in _fetch_batches
[rank0]:     self._update_state_dict()
[rank0]:   File "/fsx/umar/miniconda3/envs/memory-efficient-transformers/lib/python3.10/site-packages/accelerate/data_loader.py", line 479, in _update_state_dict
[rank0]:     self.adjust_state_dict_for_prefetch()
[rank0]:   File "/fsx/umar/miniconda3/envs/memory-efficient-transformers/lib/python3.10/site-packages/accelerate/data_loader.py", line 459, in adjust_state_dict_for_prefetch
[rank0]:     if self.dl_state_dict["_sampler_iter_yielded"] > 0:
[rank0]: KeyError: '_sampler_iter_yielded'

I also tried with the latest development version of accelerate (https://github.com/huggingface/accelerate@9f9951325c69f0a6c7c8ab00df2ab8af23b3c1fa) but I still get the same error.

@muellerzr is aware of this issue.

Expected behavior

I'd like the possibility to prefetch multiple samples and that is only possible by specifying num_workers to a number greater than 0.

hkproj avatar Sep 13 '24 14:09 hkproj

@muellerzr Hi, wondering if there are any progress on this bug. 👀 I also met this when trying the latest accelerate.

yzhangcs avatar Dec 25 '24 17:12 yzhangcs

Same here. Any progress on this one? Using this flag with "dataloader_num_workers" anything other than "0" triggers this error! It would be nice to be able to save the state of the data loader AND not be hobbled by serializing the data loader.

jdinalt avatar Sep 09 '25 06:09 jdinalt