DeepSpeed [BUG] ZeRO 3 error: expected the next 4 parameters in the parameter fetch queue to be ... but got ()

Describe the bug (previously posted here) I'm using Huggingface to train a custom model over 2 NVIDIA RTX A5000 GPUs with ZeRO stage 3 with params offloading to CPU. Everything works fine when training for the first time, but when resuming from a checkpoint (resume_from_checkpoint=/path/to/checkpoint in Huggingface), after a while I get the following error (complete log in error.txt)

  [2023-05-23 14:02:25,781] [INFO] [logging.py:96:log_dist] [Rank 0] step=14290, skipped=17, lr=[0.00014992267618019753], mom=[(0.9, 0.999)]
[2023-05-23 14:02:25,783] [INFO] [timer.py:199:stop] epoch=0/micro_step=2070/global_step=2070, RunningAvgSamplesPerSec=8.340844178398823, CurrSamplesPerSec=8.091999012978865, MemAllocated=0.4GB, MaxMemAllocated=19.03GB
{'loss': 1.0438, 'learning_rate': 0.00014992267618019753, 'epoch': 3.68}
[2023-05-23 14:02:36,757] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32768, but hysteresis is 2. Reducing hysteresis to 1
%|▍         | 14287/305600 [3:34:27<454:15:14,  5.61s/it]
  5%|▍         | 14288/305600 [3:34:33<467:44:45,  5.78s/it]
  5%|▍         | 14289/305600 [3:34:38<455:08:12,  5.62s/it]
  5%|▍         | 14290/305600 [3:34:43<443:40:08,  5.48s/it]
                                                            

  5%|▍         | 14290/305600 [3:34:43<443:40:08,  5.48s/it]
  5%|▍         | 14291/305600 [3:34:49<448:35:16,  5.54s/it]
  5%|▍         | 14292/305600 [3:34:54<442:30:06,  5.47s/it]Traceback (most recent call last):
  File "/mnt/beegfs/scratch/dcaffagni/runs/clpt_gpu_2_lr_154_cos_10k_wu/maticad_side/train.py", line 96, in <module>
    train_out = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/transformers/trainer.py", line 1633, in train
    return inner_training_loop(
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/transformers/trainer.py", line 2661, in training_step
    loss = self.deepspeed.backward(loss)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1796, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1923, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 62, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/torch/autograd/function.py", line 274, in apply
    return user_fn(self, *args)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 169, in backward
    ctx.pre_backward_function(ctx.module)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 419, in _run_before_backward_function
    self.pre_sub_module_backward_function(sub_module)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 500, in pre_sub_module_backward_function
    param_coordinator.fetch_sub_module(sub_module)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
Traceback (most recent call last):
  File "/mnt/beegfs/scratch/dcaffagni/runs/clpt_gpu_2_lr_154_cos_10k_wu/maticad_side/train.py", line 96, in <module>
    train_out = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/transformers/trainer.py", line 1633, in train
    return inner_training_loop(
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/transformers/trainer.py", line 2661, in training_step
    loss = self.deepspeed.backward(loss)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1796, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1923, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 62, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/torch/autograd/function.py", line 274, in apply
    return user_fn(self, *args)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 169, in backward
    ctx.pre_backward_function(ctx.module)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
[modeling_custom_apr.txt](https://github.com/huggingface/transformers/files/11545331/modeling_custom_apr.txt)

  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 419, in _run_before_backward_function
    self.pre_sub_module_backward_function(sub_module)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 500, in pre_sub_module_backward_function
    param_coordinator.fetch_sub_module(sub_module)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 288, in fetch_sub_module
    raise RuntimeError(
RuntimeError: tracing error at step 999: 
module id: 921, training: True
expected the next 4 parameters in the parameter fetch queue to be ({'id': 'name=attn_pool.k_proj.bias id=915', 'status': 'AVAILABLE', 'numel': 512, 'ds_numel': 512, 'shape': (512,), 'ds_shape': (512,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {921}}, {'id': 'name=attn_pool.v_proj.bias id=919', 'status': 'AVAILABLE', 'numel': 512, 'ds_numel': 512, 'shape': (512,), 'ds_shape': (512,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {921}}, {'id': 'name=attn_pool.c_proj.bias id=921', 'status': 'AVAILABLE', 'numel': 512, 'ds_numel': 512, 'shape': (512,), 'ds_shape': (512,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {921}}, {'id': 'name=attn_pool.q_proj.bias id=917', 'status': 'AVAILABLE', 'numel': 512, 'ds_numel': 512, 'shape': (512,), 'ds_shape': (512,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {921}}) 
but got 
 ().

I'm attaching also the deepspeed config file (config_adam_zero3.txt) and the model implementation file (modeling_custom_apr.txt). Curiously, the 4 parameters causing troubles are the biases of my custom attention pooling layer (included in modules.txt). I used the very same module before with or without ZeRO stage 2 and everything worked fine.

To Reproduce Unfortunately, I'm struggling to make a reproducible script, as the errors happens suddenly during training with ZeRO 3 stage activated and I'm using a custom dataset.

Expected behavior I should be able to resume training safely.

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda-11.7'
DeepSpeed general environment info:
torch install path ............... ['/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/torch']
torch version .................... 2.0.0+cu117
deepspeed install path ........... ['/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.9.1, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

transformers version: 4.27.4
Platform: Linux-5.4.0-107-generic-x86_64-with-glibc2.31
Python version: 3.9.16
Huggingface_hub version: 0.13.3
PyTorch version (GPU?): 2.0.0+cu117
2 NVIDIA RTX A5000 GPUs on the same host

Launcher context I'm launching using SLURM srun --exclusive torchrun --master_addr ${MASTER_ADDR} --master_port ${MASTER_PORT} -nproc_per_node=${WORLD_SIZE} train.py ...

Docker context N/A

Additional context N/A modeling_custom_apr.txt config_adam_zero3.txt error.txt modules.txt

May 23 '23 21:05 dcaffo98

It may be worth noting that the error happens right after the first detected OVERFLOW in the run. However, multiple overflows occurred during the previous 24h of training (before resuming from the checkpoint).

May 24 '23 11:05 dcaffo98

I'm able to reproduce the error if resuming from a checkpoint using the Huggingface's Trainer API (resume_from_checkpoint) and simulate an overflow with

# self.test_of is True for the first forward pass 
if self.test_of:
    # loss will be inf in fp16
    loss = loss * (2**16 - 1)
    self.test_of = False

at the end of my model forward function. The update step is skipped, then at the next one, the error will occur once deepspeed.backward is called.

On the other hand, if repeating the same experiment starting from scratch rather than from a checkpoint, it works fine with the usual dynamic loss scaling.

May 25 '23 14:05 dcaffo98

Thanks for the exploration! So the core reason is overflow. Is it a rooted problem with fp-16 precision? But training from scratch is too expensive. I've also tried to resume training from checkpoint, the training collapses at the same step. I also try to change the training dynamics by switching batch size but it won't help.

Feb 01 '24 03:02 xjtupanda

I'm on a different project now, and I'm experiencing the same error even while training from scratch rather than resuming from a checkpoint. Again, the error pops out after an overflow. Switching from fp16 to bf16 fixes it.

May 09 '24 10:05 dcaffo98