DeepSpeed
DeepSpeed copied to clipboard
[BUG] RuntimeError: inflight params error when using DeepSpeed for Reinforcement Learning
Description I am encountering a persistent error when using DeepSpeed for reinforcement learning, specifically with the GLM model as the actor and critic. The error persists even after upgrading to the latest versions of DeepSpeed and PyTorch Lightning. The error occurs when I set ZeRO-3 in DeepSpeed and attempt to generate experiences using the actor model. The exact error message is: "RuntimeError: still have inflight params". Here is the relevant portion of the stack trace:
log out
File "/home/jovyan/rlhf0/miniconda3/envs/py38cu113/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1736, in forward
loss = self.module(*inputs, **kwargs)
File "/home/jovyan/rlhf0/miniconda3/envs/py38cu113/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1151, in _call_impl
hook_result = hook(self, input, result)
File "/home/jovyan/rlhf0/miniconda3/envs/py38cu113/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/jovyan/rlhf0/miniconda3/envs/py38cu113/lib/python3.8/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 329, in _end_of_forward_hook
self.get_param_coordinator(training=False).reset_step()
File "/home/jovyan/rlhf0/miniconda3/envs/py38cu113/lib/python3.8/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 185, in reset_step
raise RuntimeError(f"still have inflight params "
RuntimeError: still have inflight params [<bound method Init._convert_to_deepspeed_param.<locals>.ds_summary of Parameter containing:
ds_report
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/jovyan/rlhf0/miniconda3/envs/py38cu113/lib/python3.8/site-packages/torch']
torch version .................... 1.12.1+cu113
deepspeed install path ........... ['/home/jovyan/rlhf0/miniconda3/envs/py38cu113/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.9.3, unknown, unknown
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.1
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3
I am facing the same issue.
Can anyone help?
facing the same issue
I was trying RL using trl on T5 (Seq2Seq) Model with PEFT, and facing this issue with zero stage 3. It was working fine with stage 2. Can anyone help me in this?
I have the same problem @HeyangQin
One of our recent fixes https://github.com/microsoft/DeepSpeed/pull/3819 should have fixed this issue. It is not included in the pypi release yet so you need to install deepspeed from source to apply this fix. Please let us know if you still see this inflight issue even with the fix.
@HeyangQin Problems remain. https://github.com/microsoft/DeepSpeedExamples/issues/616
One of our recent fixes #3819 should have fixed this issue. It is not included in the pypi release yet so you need to install deepspeed from source to apply this fix. Please let us know if you still see this inflight issue even with the fix.
It looks like the problem still exists.
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/jovyan/rlhf0/miniconda3/envs/py38cu113/lib/python3.8/site-packages/torch']
torch version .................... 1.12.1+cu113
deepspeed install path ........... ['/home/jovyan/rlhf0/miniconda3/envs/py38cu113/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.10.0+unknown, unknown, unknown
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.0
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3
我们最近的修复之一#3819应该已经解决了这个问题。它尚未包含在 pypi 版本中,因此您需要从源代码安装 deepspeed 才能应用此修复。即使修复后,如果您仍然遇到此飞行中问题,请告诉我们。
看起来问题仍然存在。
JIT compiled ops requires ninja ninja .................. [OKAY] -------------------------------------------------- op name ................ installed .. compatible -------------------------------------------------- [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] cpu_adagrad ............ [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] please install triton==1.0.0 if you want to use sparse attention sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] -------------------------------------------------- DeepSpeed general environment info: torch install path ............... ['/home/jovyan/rlhf0/miniconda3/envs/py38cu113/lib/python3.8/site-packages/torch'] torch version .................... 1.12.1+cu113 deepspeed install path ........... ['/home/jovyan/rlhf0/miniconda3/envs/py38cu113/lib/python3.8/site-packages/deepspeed'] deepspeed info ................... 0.10.0+unknown, unknown, unknown torch cuda version ............... 11.3 torch hip version ................ None nvcc version ..................... 11.0 deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3
The branch "HeyangQin/fix_issue_3156" solved the issue.https://github.com/microsoft/DeepSpeed/issues/3156.
@Fhujinwu - can you test with the latest master branch and confirm the above PR fixes this issue?
After updating to the latest version, I found that the previous issue has been fixed. However, I still encountered the 'inflight' error in the subsequent code at different locations.
│ 12 │ │
│ 13 │ def wrapped_fn(*args, **kwargs): │
│ 14 │ │ get_accelerator().range_push(func.__qualname__) │
│ ❱ 15 │ │ ret_val = func(*args, **kwargs) │
│ 16 │ │ get_accelerator().range_pop() │
│ 17 │ │ return ret_val │
│ 18 │
│ │
│ /home/jovyan/rlhf0/miniconda3/envs/py38cu113/lib/python3.8/site-packages/tor │
│ ch/autograd/grad_mode.py:27 in decorate_context │
│ │
│ 24 │ │ @functools.wraps(func) │
│ 25 │ │ def decorate_context(*args, **kwargs): │
│ 26 │ │ │ with self.clone(): │
│ ❱ 27 │ │ │ │ return func(*args, **kwargs) │
│ 28 │ │ return cast(F, decorate_context) │
│ 29 │ │
│ 30 │ def _wrap_generator(self, func): │
│ │
│ /home/jovyan/rlhf0/miniconda3/envs/py38cu113/lib/python3.8/site-packages/dee │
│ pspeed/runtime/zero/partitioned_param_coordinator.py:303 in fetch_sub_module │
│ │
│ 300 │ │ │ │ │ event.record() │
│ 301 │ │ │ │ │ self.__ongoing_fetch_events.append(event) │
│ 302 │ │ │ │
│ ❱ 303 │ │ │ assert param.ds_status == ZeroParamStatus.AVAILABLE, param │
│ 304 │ │ get_accelerator().current_stream().wait_stream(self.__allgathe │
│ 305 │ │ self.__profiler.stop_event(wait_event_name, wait_numel) │
│ 306 │
╰──────────────────────────────────────────────────────────────────────────────╯
AssertionError: {'id': 419, 'status': 'INFLIGHT', 'numel': 16777216, 'ds_numel':
16777216, 'shape': (8192, 2048), 'ds_shape': (8192, 2048), 'requires_grad':
True, 'grad_shape': None, 'persist': False, 'active_sub_modules': {425},
'ds_tensor.shape': torch.Size([2097152])}
ds_report
op name ................ installed .. compatible
--------------------------------------------------
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/jovyan/rlhf0/miniconda3/envs/py38cu113/lib/python3.8/site-packages/torch']
torch version .................... 1.12.1+cu113
deepspeed install path ........... ['/home/jovyan/rlhf0/miniconda3/envs/py38cu113/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.10.0+7e8bcc07, 7e8bcc07, master
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.0
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3
I update to the latest version, but when i use zero stage 3, i'm hanging with the log "[2023-07-11 16:01:28,870] [INFO] [partition_parameters.py:326:exit] finished initializing model with 6.74B parameters".
Here is my command:
deepspeed --master_port=12356 /data/bill.bi/alpaca-rlhf/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py --data_path /data/bill.bi/ShopeeRLHFDataset --data_output_path /data/bill.bi/step3_left_padding/ --actor_model_name_or_path /data/bill.bi/checkpoint-10700 --tokenizer_name_or_path /data/bill.bi/step2/rlhf/critic/checkpoint_epoch_2 --critic_model_name_or_path /data/bill.bi/step2/rlhf/critic/checkpoint_epoch_2 --num_padding_at_beginning 0 --per_device_train_batch_size 4 --per_device_mini_train_batch_size 4 --actor_learning_rate 5e-6 --critic_learning_rate 5e-6 --ppo_epochs 2 --gradient_accumulation_steps 1 --disable_actor_dropout --actor_zero_stage 3 --critic_zero_stage 3 --deepspeed --seed 1234 --critic_gradient_checkpointing --actor_gradient_checkpointing --output_dir /data/bill.bi/step3_left_padding/rlhf --max_prompt_seq_len 2048 --actor_weight_decay 0.1 --critic_weight_decay 0.1
Hello @Bill-Orz. We have fixed the hanging issue in https://github.com/microsoft/DeepSpeedExamples/pull/636. Please update to the latest DeepSpeedExample.