DeepSpeed
DeepSpeed copied to clipboard
self.client_module.attn.q_proj.weight.shape[1] returns IndexError: tuple index out of range
Describe the bug I am getting the following error while attempting to run deepspeed-chat step 3 with the actor model CarperAI/openai_summarize_tldr_sft (gpt-j 6B) and critic model CarperAI/openai_summarize_tldr_rm_checkpoint (gpt-j 6B) and ZeRO stage level 3.
Traceback (most recent call last): File "main.py", line 523, in
main() File "main.py", line 394, in main rlhf_engine = DeepSpeedRLHFEngine(
File "/data/nt12_ssd_gluster/myself/yts/dc/training/step3_rlhf_finetuning/rlhf_engine.py", line 49, in init self.actor = self._init_actor(actor_model_name_or_path=actor_model_name_or_path)
File "/data/nt12_ssd_gluster/myself/yts/dc/training/step3_rlhf_finetuning/rlhf_engine.py", line 115, in init_actor actor_engine, * = deepspeed.initialize(model=actor_model, File "/data/nt12_ssd_gluster/myself/yts/dc/training/step1_supervised_finetuning/DeepSpeed/deepspeed/init.py", line 144, in initialize engine = DeepSpeedHybridEngine(args=args,
File "/data/nt12_ssd_gluster/myself/yts/dc/training/step1_supervised_finetuning/DeepSpeed/deepspeed/runtime/hybrid_engine.py", line 52, in init self.create_inference_module()
File "/data/nt12_ssd_gluster/myself/yts/dc/training/step1_supervised_finetuning/DeepSpeed/deepspeed/runtime/hybrid_engine.py", line 326, in create_inference_module self.create_inference_containers(self.module) File "/data/nt12_ssd_gluster/myself/yts/dc/training/step1_supervised_finetuning/DeepSpeed/deepspeed/runtime/hybrid_engine.py", line 296, in create_inference_containers self.create_inference_containers(child, layer_id=layer_id) File "/data/nt12_ssd_gluster/myself/yts/dc/training/step1_supervised_finetuning/DeepSpeed/deepspeed/runtime/hybrid_engine.py", line 296, in create_inference_containers self.create_inference_containers(child, layer_id=layer_id)
File "/data/nt12_ssd_gluster/myself/yts/dc/training/step1_supervised_finetuning/DeepSpeed/deepspeed/runtime/hybrid_engine.py", line 276, in create_inference_containers self._inference_containers.append(self.inference_policies[child.class][0]( File "/data/nt12_ssd_gluster/myself/yts/dc/training/step1_supervised_finetuning/DeepSpeed/deepspeed/runtime/hybrid_engine.py", line 99, in new_inference_container _container.create_ds_model_config() File "/data/nt12_ssd_gluster/myself/yts/dc/training/step1_supervised_finetuning/DeepSpeed/deepspeed/module_inject/containers/base.py", line 79, in create_ds_model_config
self.set_hidden_heads(*self.policy.get_hidden_heads())
File "/data/nt12_ssd_gluster/myself/yts/dc/training/step1_supervised_finetuning/DeepSpeed/deepspeed/module_inject/containers/gptj.py", line 73, in get_hidden_heads return self.client_module.attn.q_proj.weight.shape[1], \ IndexError: tuple index out of range
Adding print(self.client_module.attn.q_proj.weight)
and print(self.client_module.attn.q_proj.weight.shape)
right above return self.client_module.attn.q_proj.weight.shape[1]
gets the output Parameter containing: tensor([], device='cuda:0', dtype=torch.float16, requires_grad=True)
and torch.Size([0])
. It seems that the parameters of the model are missing during the initialization of deepspeed engine.
ds_report output
DeepSpeed C++/CUDA extension op report
NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.
JIT compiled ops requires ninja ninja .................. [OKAY]
op name ................ installed .. compatible
async_io ............... [YES] ...... [OKAY] cpu_adagrad ............ [YES] ...... [OKAY] cpu_adam ............... [YES] ...... [OKAY] fused_adam ............. [YES] ...... [OKAY] fused_lamb ............. [YES] ...... [OKAY] quantizer .............. [YES] ...... [OKAY] random_ltd ............. [YES] ...... [OKAY] sparse_attn ............ [YES] ...... [OKAY] spatial_inference ...... [YES] ...... [OKAY] transformer ............ [YES] ...... [OKAY] stochastic_transformer . [YES] ...... [OKAY] transformer_inference .. [YES] ...... [OKAY] utils .................. [YES] ...... [OKAY]
DeepSpeed general environment info: torch install path ............... ['/data/nt12_ssd_gluster/myself/miniconda3/lib/python3.8/site-packages/torch'] torch version .................... 1.10.0+cu113 deepspeed install path ........... ['/data/nt12_ssd_gluster/myself/yts/dc/training/step1_supervised_finetuning/DeepSpeed/deepspeed'] deepspeed info ................... 0.9.1+cc67f22f, cc67f22f, master torch cuda version ............... 11.3 torch hip version ................ None nvcc version ..................... 11.3 deepspeed wheel compiled w. ...... torch 1.10, cuda 11.3
System info (please complete the following information):
- OS: Ubuntu 18.04
- GPU count and types: single node 8*A100
- Deepspeed version: 0.9.1+cc67f22f
- Python version: 3.8
- The installation of Deepspeed is completed by running
TORCH_CUDA_ARCH_LIST="8.0" DS_BUILD_OPS=1 pip install .
TORCH_CUDA_ARCH_LIST="8.0" DS_BUILD_OPS=1 pip install -e .
Additional context
I check the source code of deepspeed and find two free_param(param)
operations in deepspeed/runtime/zero/partition_parameters.py
, line 1115 and 1186, where the parameters are turned into torch.empty(0)
. The It seems that the params aren't restored after this operation, and remain empty till the above error occurs.
@cmikeh2 this is another error with zero_stage=3
@publicstaticvo hello,I also have this error, may I ask if you have solved it?
I deleted free_param(param)
at line 1115 of deepspeed/runtime/zero/partition_parameters.py
and it seems to work, but I don't know if it is the right solution. For example, I met another Cuda Out Of Memory problem later, and I don't know if that was the reason.
@SAXSUN which base model are you using? cmikeh2 said the project currently only supports Meta OPT models in another issue (#3284).
actor_model dolly7b,ritic_model opt350m
when will this bug be solved?