DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] RuntimeError: inflight params error when using DeepSpeed for Reinforcement Learning

Open shyustc opened this issue 1 year ago • 3 comments

Description I am encountering a persistent error when using DeepSpeed for reinforcement learning, specifically with the GLM model as the actor and critic. The error persists even after upgrading to the latest versions of DeepSpeed and PyTorch Lightning. The error occurs when I set ZeRO-3 in DeepSpeed and attempt to generate experiences using the actor model. The exact error message is: "RuntimeError: still have inflight params". Here is the relevant portion of the stack trace:

log out

   File "/home/jovyan/rlhf0/miniconda3/envs/py38cu113/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1736, in forward
    loss = self.module(*inputs, **kwargs)
  File "/home/jovyan/rlhf0/miniconda3/envs/py38cu113/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1151, in _call_impl
    hook_result = hook(self, input, result)
  File "/home/jovyan/rlhf0/miniconda3/envs/py38cu113/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/jovyan/rlhf0/miniconda3/envs/py38cu113/lib/python3.8/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 329, in _end_of_forward_hook
    self.get_param_coordinator(training=False).reset_step()
  File "/home/jovyan/rlhf0/miniconda3/envs/py38cu113/lib/python3.8/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 185, in reset_step
    raise RuntimeError(f"still have inflight params "
RuntimeError: still have inflight params [<bound method Init._convert_to_deepspeed_param.<locals>.ds_summary of Parameter containing:

ds_report

DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/jovyan/rlhf0/miniconda3/envs/py38cu113/lib/python3.8/site-packages/torch']
torch version .................... 1.12.1+cu113
deepspeed install path ........... ['/home/jovyan/rlhf0/miniconda3/envs/py38cu113/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.9.3, unknown, unknown
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.1
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3

shyustc avatar Jun 12 '23 02:06 shyustc

I am facing the same issue. image

Can anyone help?

vivek-media avatar Jun 12 '23 08:06 vivek-media

facing the same issue

Bill-Orz avatar Jun 13 '23 10:06 Bill-Orz

I was trying RL using trl on T5 (Seq2Seq) Model with PEFT, and facing this issue with zero stage 3. It was working fine with stage 2. Can anyone help me in this?

vivek-media avatar Jun 13 '23 12:06 vivek-media

I have the same problem @HeyangQin

ZJXNEFU avatar Jun 19 '23 08:06 ZJXNEFU

One of our recent fixes https://github.com/microsoft/DeepSpeed/pull/3819 should have fixed this issue. It is not included in the pypi release yet so you need to install deepspeed from source to apply this fix. Please let us know if you still see this inflight issue even with the fix.

HeyangQin avatar Jun 29 '23 18:06 HeyangQin

@HeyangQin Problems remain. https://github.com/microsoft/DeepSpeedExamples/issues/616

Fhujinwu avatar Jun 30 '23 03:06 Fhujinwu

One of our recent fixes #3819 should have fixed this issue. It is not included in the pypi release yet so you need to install deepspeed from source to apply this fix. Please let us know if you still see this inflight issue even with the fix.

It looks like the problem still exists.

JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/jovyan/rlhf0/miniconda3/envs/py38cu113/lib/python3.8/site-packages/torch']
torch version .................... 1.12.1+cu113
deepspeed install path ........... ['/home/jovyan/rlhf0/miniconda3/envs/py38cu113/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.10.0+unknown, unknown, unknown
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.0
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3

shyustc avatar Jun 30 '23 12:06 shyustc

我们最近的修复之一#3819应该已经解决了这个问题。它尚未包含在 pypi 版本中,因此您需要从源代码安装 deepspeed 才能应用此修复。即使修复后,如果您仍然遇到此飞行中问题,请告诉我们。

看起来问题仍然存在。

JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/jovyan/rlhf0/miniconda3/envs/py38cu113/lib/python3.8/site-packages/torch']
torch version .................... 1.12.1+cu113
deepspeed install path ........... ['/home/jovyan/rlhf0/miniconda3/envs/py38cu113/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.10.0+unknown, unknown, unknown
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.0
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3

The branch "HeyangQin/fix_issue_3156" solved the issue.https://github.com/microsoft/DeepSpeed/issues/3156.

Fhujinwu avatar Jul 01 '23 01:07 Fhujinwu

@Fhujinwu - can you test with the latest master branch and confirm the above PR fixes this issue?

loadams avatar Jul 06 '23 19:07 loadams

After updating to the latest version, I found that the previous issue has been fixed. However, I still encountered the 'inflight' error in the subsequent code at different locations.

│   12 │                                                                       │
│   13 │   def wrapped_fn(*args, **kwargs):                                    │
│   14 │   │   get_accelerator().range_push(func.__qualname__)                 │
│ ❱ 15 │   │   ret_val = func(*args, **kwargs)                                 │
│   16 │   │   get_accelerator().range_pop()                                   │
│   17 │   │   return ret_val                                                  │
│   18                                                                         │
│                                                                              │
│ /home/jovyan/rlhf0/miniconda3/envs/py38cu113/lib/python3.8/site-packages/tor │
│ ch/autograd/grad_mode.py:27 in decorate_context                              │
│                                                                              │
│    24 │   │   @functools.wraps(func)                                         │
│    25 │   │   def decorate_context(*args, **kwargs):                         │
│    26 │   │   │   with self.clone():                                         │
│ ❱  27 │   │   │   │   return func(*args, **kwargs)                           │
│    28 │   │   return cast(F, decorate_context)                               │
│    29 │                                                                      │
│    30 │   def _wrap_generator(self, func):                                   │
│                                                                              │
│ /home/jovyan/rlhf0/miniconda3/envs/py38cu113/lib/python3.8/site-packages/dee │
│ pspeed/runtime/zero/partitioned_param_coordinator.py:303 in fetch_sub_module │
│                                                                              │
│   300 │   │   │   │   │   event.record()                                     │
│   301 │   │   │   │   │   self.__ongoing_fetch_events.append(event)          │
│   302 │   │   │                                                              │
│ ❱ 303 │   │   │   assert param.ds_status == ZeroParamStatus.AVAILABLE, param │
│   304 │   │   get_accelerator().current_stream().wait_stream(self.__allgathe │
│   305 │   │   self.__profiler.stop_event(wait_event_name, wait_numel)        │
│   306                                                                        │
╰──────────────────────────────────────────────────────────────────────────────╯
AssertionError: {'id': 419, 'status': 'INFLIGHT', 'numel': 16777216, 'ds_numel':
16777216, 'shape': (8192, 2048), 'ds_shape': (8192, 2048), 'requires_grad':
True, 'grad_shape': None, 'persist': False, 'active_sub_modules': {425},
'ds_tensor.shape': torch.Size([2097152])}

ds_report

op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/jovyan/rlhf0/miniconda3/envs/py38cu113/lib/python3.8/site-packages/torch']
torch version .................... 1.12.1+cu113
deepspeed install path ........... ['/home/jovyan/rlhf0/miniconda3/envs/py38cu113/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.10.0+7e8bcc07, 7e8bcc07, master
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.0
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3

shyustc avatar Jul 08 '23 06:07 shyustc

I update to the latest version, but when i use zero stage 3, i'm hanging with the log "[2023-07-11 16:01:28,870] [INFO] [partition_parameters.py:326:exit] finished initializing model with 6.74B parameters".

image

image

Here is my command: deepspeed --master_port=12356 /data/bill.bi/alpaca-rlhf/alpaca_rlhf/deepspeed_chat/training/step3_rlhf_finetuning/main.py --data_path /data/bill.bi/ShopeeRLHFDataset --data_output_path /data/bill.bi/step3_left_padding/ --actor_model_name_or_path /data/bill.bi/checkpoint-10700 --tokenizer_name_or_path /data/bill.bi/step2/rlhf/critic/checkpoint_epoch_2 --critic_model_name_or_path /data/bill.bi/step2/rlhf/critic/checkpoint_epoch_2 --num_padding_at_beginning 0 --per_device_train_batch_size 4 --per_device_mini_train_batch_size 4 --actor_learning_rate 5e-6 --critic_learning_rate 5e-6 --ppo_epochs 2 --gradient_accumulation_steps 1 --disable_actor_dropout --actor_zero_stage 3 --critic_zero_stage 3 --deepspeed --seed 1234 --critic_gradient_checkpointing --actor_gradient_checkpointing --output_dir /data/bill.bi/step3_left_padding/rlhf --max_prompt_seq_len 2048 --actor_weight_decay 0.1 --critic_weight_decay 0.1

Bill-Orz avatar Jul 11 '23 08:07 Bill-Orz

Hello @Bill-Orz. We have fixed the hanging issue in https://github.com/microsoft/DeepSpeedExamples/pull/636. Please update to the latest DeepSpeedExample.

HeyangQin avatar Jul 21 '23 18:07 HeyangQin