[BUG] ZeRO 3 error: expected the next 4 parameters in the parameter fetch queue to be ... but got ()
Describe the bug
(previously posted here)
I'm using Huggingface to train a custom model over 2 NVIDIA RTX A5000 GPUs with ZeRO stage 3 with params offloading to CPU. Everything works fine when training for the first time, but when resuming from a checkpoint (resume_from_checkpoint=/path/to/checkpoint in Huggingface), after a while I get the following error (complete log in error.txt)
[2023-05-23 14:02:25,781] [INFO] [logging.py:96:log_dist] [Rank 0] step=14290, skipped=17, lr=[0.00014992267618019753], mom=[(0.9, 0.999)]
[2023-05-23 14:02:25,783] [INFO] [timer.py:199:stop] epoch=0/micro_step=2070/global_step=2070, RunningAvgSamplesPerSec=8.340844178398823, CurrSamplesPerSec=8.091999012978865, MemAllocated=0.4GB, MaxMemAllocated=19.03GB
{'loss': 1.0438, 'learning_rate': 0.00014992267618019753, 'epoch': 3.68}
[2023-05-23 14:02:36,757] [INFO] [loss_scaler.py:188:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32768, but hysteresis is 2. Reducing hysteresis to 1
%|▍ | 14287/305600 [3:34:27<454:15:14, 5.61s/it]
5%|▍ | 14288/305600 [3:34:33<467:44:45, 5.78s/it]
5%|▍ | 14289/305600 [3:34:38<455:08:12, 5.62s/it]
5%|▍ | 14290/305600 [3:34:43<443:40:08, 5.48s/it]
5%|▍ | 14290/305600 [3:34:43<443:40:08, 5.48s/it]
5%|▍ | 14291/305600 [3:34:49<448:35:16, 5.54s/it]
5%|▍ | 14292/305600 [3:34:54<442:30:06, 5.47s/it]Traceback (most recent call last):
File "/mnt/beegfs/scratch/dcaffagni/runs/clpt_gpu_2_lr_154_cos_10k_wu/maticad_side/train.py", line 96, in <module>
train_out = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/transformers/trainer.py", line 1633, in train
return inner_training_loop(
File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/transformers/trainer.py", line 2661, in training_step
loss = self.deepspeed.backward(loss)
File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1796, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1923, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 62, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/torch/autograd/__init__.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/torch/autograd/function.py", line 274, in apply
return user_fn(self, *args)
File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 169, in backward
ctx.pre_backward_function(ctx.module)
File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 419, in _run_before_backward_function
self.pre_sub_module_backward_function(sub_module)
File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 500, in pre_sub_module_backward_function
param_coordinator.fetch_sub_module(sub_module)
File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
Traceback (most recent call last):
File "/mnt/beegfs/scratch/dcaffagni/runs/clpt_gpu_2_lr_154_cos_10k_wu/maticad_side/train.py", line 96, in <module>
train_out = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/transformers/trainer.py", line 1633, in train
return inner_training_loop(
File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/transformers/trainer.py", line 2661, in training_step
loss = self.deepspeed.backward(loss)
File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1796, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 1923, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 62, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/torch/autograd/__init__.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/torch/autograd/function.py", line 274, in apply
return user_fn(self, *args)
File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 169, in backward
ctx.pre_backward_function(ctx.module)
File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
[modeling_custom_apr.txt](https://github.com/huggingface/transformers/files/11545331/modeling_custom_apr.txt)
File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 419, in _run_before_backward_function
self.pre_sub_module_backward_function(sub_module)
File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 500, in pre_sub_module_backward_function
param_coordinator.fetch_sub_module(sub_module)
File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 288, in fetch_sub_module
raise RuntimeError(
RuntimeError: tracing error at step 999:
module id: 921, training: True
expected the next 4 parameters in the parameter fetch queue to be ({'id': 'name=attn_pool.k_proj.bias id=915', 'status': 'AVAILABLE', 'numel': 512, 'ds_numel': 512, 'shape': (512,), 'ds_shape': (512,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {921}}, {'id': 'name=attn_pool.v_proj.bias id=919', 'status': 'AVAILABLE', 'numel': 512, 'ds_numel': 512, 'shape': (512,), 'ds_shape': (512,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {921}}, {'id': 'name=attn_pool.c_proj.bias id=921', 'status': 'AVAILABLE', 'numel': 512, 'ds_numel': 512, 'shape': (512,), 'ds_shape': (512,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {921}}, {'id': 'name=attn_pool.q_proj.bias id=917', 'status': 'AVAILABLE', 'numel': 512, 'ds_numel': 512, 'shape': (512,), 'ds_shape': (512,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {921}})
but got
().
I'm attaching also the deepspeed config file (config_adam_zero3.txt) and the model implementation file (modeling_custom_apr.txt). Curiously, the 4 parameters causing troubles are the biases of my custom attention pooling layer (included in modules.txt). I used the very same module before with or without ZeRO stage 2 and everything worked fine.
To Reproduce Unfortunately, I'm struggling to make a reproducible script, as the errors happens suddenly during training with ZeRO 3 stage activated and I'm using a custom dataset.
Expected behavior I should be able to resume training safely.
ds_report output
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
[WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda-11.7'
DeepSpeed general environment info:
torch install path ............... ['/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/torch']
torch version .................... 2.0.0+cu117
deepspeed install path ........... ['/homes/dcaffagni/.conda/envs/glpn_hf/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.9.1, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7
Screenshots If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
transformersversion: 4.27.4- Platform: Linux-5.4.0-107-generic-x86_64-with-glibc2.31
- Python version: 3.9.16
- Huggingface_hub version: 0.13.3
- PyTorch version (GPU?): 2.0.0+cu117
- 2 NVIDIA RTX A5000 GPUs on the same host
Launcher context
I'm launching using SLURM
srun --exclusive torchrun --master_addr ${MASTER_ADDR} --master_port ${MASTER_PORT} -nproc_per_node=${WORLD_SIZE} train.py ...
Docker context N/A
Additional context N/A modeling_custom_apr.txt config_adam_zero3.txt error.txt modules.txt
It may be worth noting that the error happens right after the first detected OVERFLOW in the run. However, multiple overflows occurred during the previous 24h of training (before resuming from the checkpoint).
I'm able to reproduce the error if resuming from a checkpoint using the Huggingface's Trainer API (resume_from_checkpoint) and simulate an overflow with
# self.test_of is True for the first forward pass
if self.test_of:
# loss will be inf in fp16
loss = loss * (2**16 - 1)
self.test_of = False
at the end of my model forward function. The update step is skipped, then at the next one, the error will occur once deepspeed.backward is called.
On the other hand, if repeating the same experiment starting from scratch rather than from a checkpoint, it works fine with the usual dynamic loss scaling.
Thanks for the exploration! So the core reason is overflow. Is it a rooted problem with fp-16 precision? But training from scratch is too expensive. I've also tried to resume training from checkpoint, the training collapses at the same step. I also try to change the training dynamics by switching batch size but it won't help.
I'm on a different project now, and I'm experiencing the same error even while training from scratch rather than resuming from a checkpoint. Again, the error pops out after an overflow. Switching from fp16 to bf16 fixes it.