DeepSpeed
DeepSpeed copied to clipboard
[BUG]running step3 use bloomz + lora + zero3, raise RuntimeError(f"{param.ds_summary()} already in registry")
Describe the bug When running step 3 with ZERO stage 3 enabled and lora for both the actor and critic models. An error was reported, it seems to tell me that bloomz does not support zero3+lora.
Log output
Traceback (most recent call last):
File "DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 630, in <module>
main()
File "DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 477, in main
out = trainer.generate_experience(batch_prompt['prompt'],
File "DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 108, in generate_experience
output = self.actor_model(seq, attention_mask=attention_mask)
File "DeepSpeedExamples/dcv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "DeepSpeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "DeepSpeed/deepspeed/runtime/engine.py", line 1695, in forward
loss = self.module(*inputs, **kwargs)
File "DeepSpeedExamples/dcv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1208, in _call_impl
result = forward_call(*input, **kwargs)
File "DeepSpeedExamples/dcv/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 913, in forward
transformer_outputs = self.transformer(
File "DeepSpeedExamples/dcv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1208, in _call_impl
result = forward_call(*input, **kwargs)
File "DeepSpeedExamples/dcv/lib/python3.9/site-packages/transformers/models/bloom/modeling_bloom.py", line 730, in forward
inputs_embeds = self.word_embeddings(input_ids)
File "DeepSpeedExamples/dcv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1197, in _call_impl
result = hook(self, input)
File "DeepSpeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "DeepSpeed/deepspeed/runtime/zero/parameter_offload.py", line 366, in _pre_forward_module_hook
self.pre_sub_module_forward_function(module)
File "DeepSpeedExamples/dcv/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "DeepSpeed/deepspeed/runtime/zero/parameter_offload.py", line 478, in pre_sub_module_forward_function
param_coordinator.fetch_sub_module(sub_module)
File "DeepSpeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "DeepSpeedExamples/dcv/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "DeepSpeed/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 249, in fetch_sub_module
self.__all_gather_params(params_to_fetch)
File "DeepSpeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "DeepSpeed/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 383, in __all_gather_params
self.__inflight_param_registry[param] = handle
File "DeepSpeed/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 51, in __setitem__
raise RuntimeError(f"{param.ds_summary()} already in registry")
RuntimeError: {'id': 0, 'status': 'INFLIGHT', 'numel': 1027604480, 'ds_numel': 1027604480, 'shape': (250880, 4096), 'ds_shape': (250880, 4096), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set()} already in registry
To Reproduce
the run.sh
is:
sh training_scripts/single_node/run_bloom_1b7.sh \
bigscience/bloomz-1b7 \
bigscience/bloomz-1b7 \
3 \
3 \
output_single_node_bloomz1b7
the run_bloom_1b7.sh
is:
#!/bin/bash
# Copyright (c) Microsoft Corporation.
# SPDX-License-Identifier: Apache-2.0
# DeepSpeed Team
ACTOR_MODEL_PATH=$1
CRITIC_MODEL_PATH=$2
ACTOR_ZERO_STAGE=${3:-2}
CRITIC_ZERO_STAGE=${4:-2}
OUTPUT=${5:-'./output'}
NUM_GPUS=${6:-8}
NUM_NODES=${7:-1}
mkdir -p $OUTPUT
Num_Padding_at_Beginning=0 # this is model related
Actor_Lr=9.65e-6
Critic_Lr=5e-6
hostname='localhost'
export NCCL_SOCKET_IFNAME=eth
export NCCL_DEBUG=INFO
export TOKENIZERS_PARALLELISM=false
deepspeed --master_port 25303 --master_addr ${hostname} --num_gpus ${NUM_GPUS} --num_nodes ${NUM_NODES} --hostfile 'deepspeed_hostfile' main.py \
--data_path Dahoas/rm-static \
--data_split 2,4,4 \
--actor_model_name_or_path $ACTOR_MODEL_PATH \
--critic_model_name_or_path $CRITIC_MODEL_PATH \
--num_padding_at_beginning $Num_Padding_at_Beginning \
--per_device_train_batch_size 1 \
--per_device_mini_train_batch_size 1 \
--generation_batch_numbers 1 \
--ppo_epochs 1 \
--max_answer_seq_len 256 \
--max_prompt_seq_len 256 \
--actor_learning_rate ${Actor_Lr} \
--critic_learning_rate ${Critic_Lr} \
--disable_actor_dropout \
--num_train_epochs 1 \
--lr_scheduler_type cosine \
--gradient_accumulation_steps 1 \
--num_warmup_steps 100 \
--deepspeed --seed 1234 \
--inference_tp_size 1 \
--tp_gather_partition_size ${NUM_GPUS} \
--actor_zero_stage $ACTOR_ZERO_STAGE \
--critic_zero_stage $CRITIC_ZERO_STAGE \
--actor_lora_dim 128 \
--actor_lora_module_name query_key_value \
--critic_lora_dim 128 \
--critic_lora_module_name query_key_value \
--only_optimize_lora \
--output_dir $OUTPUT |& tee $OUTPUT/training.log
Expected behavior use zero3+lora for training step3
ds_report output
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [YES] ...... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [YES] ...... [OKAY]
fused_adam ............. [YES] ...... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [YES] ...... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/venv/lib/python3.9/site-packages/torch']
torch version .................... 1.13.1+cu117
deepspeed install path ........... ['/usr/local/venv/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.9.3+194053b, 194053b, master
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7
Screenshots
no. The error is in the Log output
System info (please complete the following information):
- OS: Linux version 4.18.0-240.el8.x86_64. CentOS Linux 7 (Core).
- GPU count and types: one machine with x8 A100s each
- Python version: 3.9.13
Docker context no
Additional context no
@cmikeh2 @jeffra @lekurile @awan-10
Hello @liuaiting. Thank you for reporting this issue to us. One of our recent fixes https://github.com/microsoft/DeepSpeed/pull/3462 may have already fixed this error. Could you update your deepspeed and give it another try?
Hello @liuaiting. Thank you for reporting this issue to us. One of our recent fixes #3462 may have already fixed this error. Could you update your deepspeed and give it another try?
After I update deepspeed, it can run successfully, thank you very much for your reply.
@liuaiting Glad to hear the error is fixed. Closing the issue
@HeyangQin Still encounter this with the deepspeed version 0.10.3, running step3 use llama2 + lora + zero3, v100*32G
anaconda3.9/envs/dschat/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 52, in setitem raise RuntimeError(f"{param.ds_summary()} already in registry") RuntimeError: {'id': 0, 'status': 'INFLIGHT', 'numel': 262144000, 'ds_numel': 262144000, 'shape': (64000, 4096), 'ds_shape': (64000, 4096), 'requires_grad': False, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([65536000])} already in registry
Even though my local copy of repository is up to date I am encountering this error. Log is below. Last line of the log shows the command I run with all the options.
Epoch: 0 | Step: 75 | PPO Epoch: 1 | Actor Loss: 0.05474853515625 | Critic Loss: 0.0821533203125 | Unsupervised Loss: 0.0 End-to-End => Latency: 76.57s, TFLOPs: 0.72, Samples/sec: 0.10, Time/seq 9.57s, Batch Size: 8, Total Seq. Length: 512 Generation => Latency: 73.24s, Per-token Latency 286.11 ms, TFLOPs: 0.18, BW: 93.15 GB/sec, Answer Seq. Length: 256 Training => Latency: 3.33s, TFLOPs: 12.65 Actor Model Parameters => 13.325 B, Critic Model Parameters => 0.331 B Average reward score: -1.51953125
Invalidate trace cache @ step 55440: expected module 0, but got module 13
Traceback (most recent call last):
Traceback (most recent call last):
File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 632, in
main() File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 492, in main
File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 492, in main
main()
File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 492, in main
out = trainer.generate_experience(batch_prompt['prompt'],
File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 129, in generate_experience
out = trainer.generate_experience(batch_prompt['prompt'],
out = trainer.generate_experience(batch_prompt['prompt'], File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 129, in generate_experience
output = self.actor_model(seq, attention_mask=attention_mask)
out = trainer.generate_experience(batch_prompt['prompt'],
main() File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 129, in generate_experience
main()
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
output = self.actor_model(seq, attention_mask=attention_mask) File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 129, in generate_experience File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 492, in main File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 492, in main
main()output = self.actor_model(seq, attention_mask=attention_mask) File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 492, in main
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl output = self.actor_model(seq, attention_mask=attention_mask) File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl out = trainer.generate_experience(batch_prompt['prompt'],out = trainer.generate_experience(batch_prompt['prompt'],
File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 129, in generate_experience
File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 129, in generate_experience
out = trainer.generate_experience(batch_prompt['prompt'],
return forward_call(*args, **kwargs) File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 129, in generate_experience
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
output = self.actor_model(seq, attention_mask=attention_mask) output = self.actor_model(seq, attention_mask=attention_mask)return forward_call(*args, **kwargs) return forward_call(*args, **kwargs)
return forward_call(*args, **kwargs)
output = self.actor_model(seq, attention_mask=attention_mask) File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl ret_val = func(*args, **kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1801, in forward ret_val = func(*args, **kwargs)ret_val = func(*args, **kwargs)ret_val = func(*args, **kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1801, in forward
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1801, in forward
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1801, in forward
Traceback (most recent call last):
File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 632, in
return forward_call(*args, **kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1801, in forward
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1801, in forward
ret_val = func(*args, **kwargs)loss = self.module(*inputs, **kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1801, in forward File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl loss = self.module(*inputs, **kwargs)loss = self.module(*inputs, **kwargs)loss = self.module(*inputs, **kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl result = forward_call(*args, **kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 944, in forward loss = self.module(*inputs, **kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl loss = self.module(*inputs, **kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl loss = self.module(*inputs, **kwargs) result = forward_call(*args, **kwargs) result = forward_call(*args, **kwargs)result = forward_call(*args, **kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 944, in forward File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 944, in forward File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 944, in forward outputs = self.model.decoder( File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl result = forward_call(*args, **kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 944, in forward result = forward_call(*args, **kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 944, in forward main() File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 492, in main outputs = self.model.decoder( outputs = self.model.decoder(outputs = self.model.decoder( result = forward_call(*args, **kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 944, in forward out = trainer.generate_experience(batch_prompt['prompt'],result = forward_call(*args, **kwargs)outputs = self.model.decoder(outputs = self.model.decoder(
File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 653, in forward File "/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 129, in generate_experience File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl output = self.actor_model(seq, attention_mask=attention_mask) File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl result = forward_call(*args, **kwargs) outputs = self.model.decoder(result = forward_call(*args, **kwargs) result = forward_call(*args, **kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 653, in forward pos_embeds = self.embed_positions(attention_mask, past_key_values_length) File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 653, in forward
File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 653, in forward File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl result = forward_call(*args, **kwargs)result = forward_call(*args, **kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 653, in forward File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 653, in forward return forward_call(*args, **kwargs)pos_embeds = self.embed_positions(attention_mask, past_key_values_length)
pos_embeds = self.embed_positions(attention_mask, past_key_values_length) File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
pos_embeds = self.embed_positions(attention_mask, past_key_values_length) File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
pos_embeds = self.embed_positions(attention_mask, past_key_values_length) result = forward_call(*args, **kwargs)ret_val = func(*args, **kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
pos_embeds = self.embed_positions(attention_mask, past_key_values_length) result = hook(self, args)
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 653, in forward
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1801, in forward
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
result = hook(self, args)
result = hook(self, args) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook
ret_val = func(*args, **kwargs)
pos_embeds = self.embed_positions(attention_mask, past_key_values_length)ret_val = func(*args, **kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook
result = hook(self, args)
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn result = hook(self, args) self.pre_sub_module_forward_function(module) result = hook(self, args) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn self.pre_sub_module_forward_function(module)ret_val = func(*args, **kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function self.pre_sub_module_forward_function(module)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook ret_val = func(*args, **kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function loss = self.module(*inputs, **kwargs)
result = hook(self, args)ret_val = func(*args, **kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn self.pre_sub_module_forward_function(module) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook
param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function self.pre_sub_module_forward_function(module)ret_val = func(*args, **kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn self.pre_sub_module_forward_function(module) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook
ret_val = func(*args, **kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function
result = forward_call(*args, **kwargs)
ret_val = func(*args, **kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state)ret_val = func(*args, **kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 944, in forward
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
self.pre_sub_module_forward_function(module) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state)
param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function
return func(*args, **kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
return func(*args, **kwargs)
return func(*args, **kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module outputs = self.model.decoder( File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ret_val = func(*args, **kwargs)param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state) File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
ret_val = func(*args, **kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs) self.__all_gather_params(params_to_fetch, forward)
self.__all_gather_params(params_to_fetch, forward)
self.__all_gather_params(params_to_fetch, forward) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
return func(*args, **kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs)return func(*args, **kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module
result = forward_call(*args, **kwargs)ret_val = func(*args, **kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context ret_val = func(*args, **kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module ret_val = func(*args, **kwargs)
self.__all_gather_params(params_to_fetch, forward)
File "/home/user1/venv/ds/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 653, in forward File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in __all_gather_params
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in __all_gather_params File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in __all_gather_params File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn return func(*args, **kwargs)self.__all_gather_params(params_to_fetch, forward)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)self.__all_gather_params(params_to_fetch, forward)self._all_gather_params(nonquantized_params, forward, quantize=self.zero_quantized_weights)
self.__all_gather_params_(nonquantized_params, forward, quantize=self.zero_quantized_weights)self.__all_gather_params_(nonquantized_params, forward, quantize=self.zero_quantized_weights)pos_embeds = self.embed_positions(attention_mask, past_key_values_length) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in __all_gather_params
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 453, in _all_gather_params ret_val = func(*args, **kwargs)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 453, in _all_gather_params self.__all_gather_params(params_to_fetch, forward) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 453, in _all_gather_params File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in __all_gather_params ret_val = func(*args, **kwargs)
self._all_gather_params(nonquantized_params, forward, quantize=self.zero_quantized_weights) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn self.__inflight_param_registry[param] = handle File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in __all_gather_params
self.__inflight_param_registry[param] = handle
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 453, in __all_gather_params_
self.__inflight_param_registry[param] = handle File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 52, in setitem self._all_gather_params(nonquantized_params, forward, quantize=self.zero_quantized_weights)ret_val = func(*args, **kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 52, in setitem self._all_gather_params(nonquantized_params, forward, quantize=self.zero_quantized_weights)
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 52, in setitem
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 453, in __all_gather_params_
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in __all_gather_params result = hook(self, args)raise RuntimeError(f"{param.ds_summary()} already in registry") File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 453, in _all_gather_params self.__inflight_param_registry[param] = handleraise RuntimeError(f"{param.ds_summary()} already in registry")
raise RuntimeError(f"{param.ds_summary()} already in registry")
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
RuntimeError File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 52, in __setitem__
RuntimeErrorself.__inflight_param_registry[param] = handleself.__all_gather_params_(nonquantized_params, forward, quantize=self.zero_quantized_weights)RuntimeError : self.__inflight_param_registry[param] = handle:
: ret_val = func(*args, **kwargs){'id': 1, 'status': 'INFLIGHT', 'numel': 10496000, 'ds_numel': 10496000, 'shape': (2050, 5120), 'ds_shape': (2050, 5120), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([1312000])} already in registry
{'id': 1, 'status': 'INFLIGHT', 'numel': 10496000, 'ds_numel': 10496000, 'shape': (2050, 5120), 'ds_shape': (2050, 5120), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([1312000])} already in registry File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 52, in setitem File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 453, in _all_gather_params raise RuntimeError(f"{param.ds_summary()} already in registry"){'id': 1, 'status': 'INFLIGHT', 'numel': 10496000, 'ds_numel': 10496000, 'shape': (2050, 5120), 'ds_shape': (2050, 5120), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([1312000])} already in registry
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 52, in setitem
File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook
raise RuntimeError(f"{param.ds_summary()} already in registry")RuntimeError
: raise RuntimeError(f"{param.ds_summary()} already in registry")self.__inflight_param_registry[param] = handle{'id': 1, 'status': 'INFLIGHT', 'numel': 10496000, 'ds_numel': 10496000, 'shape': (2050, 5120), 'ds_shape': (2050, 5120), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([1312000])} already in registry
RuntimeError
: RuntimeError File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 52, in __setitem__
self.pre_sub_module_forward_function(module){'id': 1, 'status': 'INFLIGHT', 'numel': 10496000, 'ds_numel': 10496000, 'shape': (2050, 5120), 'ds_shape': (2050, 5120), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([1312000])} already in registry:
{'id': 1, 'status': 'INFLIGHT', 'numel': 10496000, 'ds_numel': 10496000, 'shape': (2050, 5120), 'ds_shape': (2050, 5120), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([1312000])} already in registry File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function
raise RuntimeError(f"{param.ds_summary()} already in registry")
RuntimeError: {'id': 1, 'status': 'INFLIGHT', 'numel': 10496000, 'ds_numel': 10496000, 'shape': (2050, 5120), 'ds_shape': (2050, 5120), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([1312000])} already in registry param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module self.__all_gather_params(params_to_fetch, forward) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in __all_gather_params self._all_gather_params(nonquantized_params, forward, quantize=self.zero_quantized_weights) File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 453, in _all_gather_params self.__inflight_param_registry[param] = handle File "/home/user1/venv/ds/lib/python3.10/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 52, in setitem raise RuntimeError(f"{param.ds_summary()} already in registry") RuntimeError: {'id': 1, 'status': 'INFLIGHT', 'numel': 10496000, 'ds_numel': 10496000, 'shape': (2050, 5120), 'ds_shape': (2050, 5120), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([1312000])} already in registry [2023-09-15 10:36:50,504] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2907797 [2023-09-15 10:36:50,546] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2907798 [2023-09-15 10:36:50,547] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2907799 [2023-09-15 10:36:51,115] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2907800 [2023-09-15 10:36:51,443] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2907801 [2023-09-15 10:36:52,095] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2907802 [2023-09-15 10:36:52,138] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2907803 [2023-09-15 10:36:52,178] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2907804 [2023-09-15 10:36:52,218] [ERROR] [launch.py:321:sigkill_handler] ['/home/user1/venv/ds/bin/python3', '-u', 'main.py', '--local_rank=7', '--data_path', 'Dahoas/rm-static', '--data_split', '2,4,4', '--actor_model_name_or_path', '/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/13b', '--critic_model_name_or_path', '/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/output/reward-models/350m', '--num_padding_at_beginning', '1', '--per_device_generation_batch_size', '1', '--per_device_training_batch_size', '1', '--generation_batches', '1', '--ppo_epochs', '1', '--max_answer_seq_len', '256', '--max_prompt_seq_len', '256', '--actor_learning_rate', '5e-4', '--critic_learning_rate', '5e-6', '--num_train_epochs', '1', '--lr_scheduler_type', 'cosine', '--offload_reference_model', '--gradient_accumulation_steps', '1', '--actor_gradient_checkpointing', '--critic_gradient_checkpointing', '--num_warmup_steps', '100', '--deepspeed', '--seed', '1234', '--inference_tp_size', '2', '--actor_zero_stage', '3', '--critic_zero_stage', '3', '--disable_actor_dropout', '--actor_lora_dim', '128', '--actor_lora_module_name', 'decoder.layers.', '--output_dir', '/home/user1/DeepSpeedExamples/applications/DeepSpeed-Chat/output/step3-models/13b'] exits with return code = 1