DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

self.client_module.attn.q_proj.weight.shape[1] returns IndexError: tuple index out of range

Open publicstaticvo opened this issue 1 year ago • 4 comments

Describe the bug I am getting the following error while attempting to run deepspeed-chat step 3 with the actor model CarperAI/openai_summarize_tldr_sft (gpt-j 6B) and critic model CarperAI/openai_summarize_tldr_rm_checkpoint (gpt-j 6B) and ZeRO stage level 3.

Traceback (most recent call last): File "main.py", line 523, in main() File "main.py", line 394, in main rlhf_engine = DeepSpeedRLHFEngine(
File "/data/nt12_ssd_gluster/myself/yts/dc/training/step3_rlhf_finetuning/rlhf_engine.py", line 49, in init self.actor = self._init_actor(actor_model_name_or_path=actor_model_name_or_path)
File "/data/nt12_ssd_gluster/myself/yts/dc/training/step3_rlhf_finetuning/rlhf_engine.py", line 115, in init_actor actor_engine, * = deepspeed.initialize(model=actor_model, File "/data/nt12_ssd_gluster/myself/yts/dc/training/step1_supervised_finetuning/DeepSpeed/deepspeed/init.py", line 144, in initialize engine = DeepSpeedHybridEngine(args=args,
File "/data/nt12_ssd_gluster/myself/yts/dc/training/step1_supervised_finetuning/DeepSpeed/deepspeed/runtime/hybrid_engine.py", line 52, in init self.create_inference_module()
File "/data/nt12_ssd_gluster/myself/yts/dc/training/step1_supervised_finetuning/DeepSpeed/deepspeed/runtime/hybrid_engine.py", line 326, in create_inference_module self.create_inference_containers(self.module) File "/data/nt12_ssd_gluster/myself/yts/dc/training/step1_supervised_finetuning/DeepSpeed/deepspeed/runtime/hybrid_engine.py", line 296, in create_inference_containers self.create_inference_containers(child, layer_id=layer_id) File "/data/nt12_ssd_gluster/myself/yts/dc/training/step1_supervised_finetuning/DeepSpeed/deepspeed/runtime/hybrid_engine.py", line 296, in create_inference_containers self.create_inference_containers(child, layer_id=layer_id)
File "/data/nt12_ssd_gluster/myself/yts/dc/training/step1_supervised_finetuning/DeepSpeed/deepspeed/runtime/hybrid_engine.py", line 276, in create_inference_containers self._inference_containers.append(self.inference_policies[child.class][0]( File "/data/nt12_ssd_gluster/myself/yts/dc/training/step1_supervised_finetuning/DeepSpeed/deepspeed/runtime/hybrid_engine.py", line 99, in new_inference_container _container.create_ds_model_config() File "/data/nt12_ssd_gluster/myself/yts/dc/training/step1_supervised_finetuning/DeepSpeed/deepspeed/module_inject/containers/base.py", line 79, in create_ds_model_config
self.set_hidden_heads(*self.policy.get_hidden_heads())
File "/data/nt12_ssd_gluster/myself/yts/dc/training/step1_supervised_finetuning/DeepSpeed/deepspeed/module_inject/containers/gptj.py", line 73, in get_hidden_heads return self.client_module.attn.q_proj.weight.shape[1], \ IndexError: tuple index out of range

Adding print(self.client_module.attn.q_proj.weight) and print(self.client_module.attn.q_proj.weight.shape) right above return self.client_module.attn.q_proj.weight.shape[1] gets the output Parameter containing: tensor([], device='cuda:0', dtype=torch.float16, requires_grad=True) and torch.Size([0]). It seems that the parameters of the model are missing during the initialization of deepspeed engine.

ds_report output


DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

async_io ............... [YES] ...... [OKAY] cpu_adagrad ............ [YES] ...... [OKAY] cpu_adam ............... [YES] ...... [OKAY] fused_adam ............. [YES] ...... [OKAY] fused_lamb ............. [YES] ...... [OKAY] quantizer .............. [YES] ...... [OKAY] random_ltd ............. [YES] ...... [OKAY] sparse_attn ............ [YES] ...... [OKAY] spatial_inference ...... [YES] ...... [OKAY] transformer ............ [YES] ...... [OKAY] stochastic_transformer . [YES] ...... [OKAY] transformer_inference .. [YES] ...... [OKAY] utils .................. [YES] ...... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/data/nt12_ssd_gluster/myself/miniconda3/lib/python3.8/site-packages/torch'] torch version .................... 1.10.0+cu113 deepspeed install path ........... ['/data/nt12_ssd_gluster/myself/yts/dc/training/step1_supervised_finetuning/DeepSpeed/deepspeed'] deepspeed info ................... 0.9.1+cc67f22f, cc67f22f, master torch cuda version ............... 11.3 torch hip version ................ None nvcc version ..................... 11.3 deepspeed wheel compiled w. ...... torch 1.10, cuda 11.3

System info (please complete the following information):

  • OS: Ubuntu 18.04
  • GPU count and types: single node 8*A100
  • Deepspeed version: 0.9.1+cc67f22f
  • Python version: 3.8
  • The installation of Deepspeed is completed by running
TORCH_CUDA_ARCH_LIST="8.0" DS_BUILD_OPS=1 pip install . 
TORCH_CUDA_ARCH_LIST="8.0" DS_BUILD_OPS=1 pip install -e .

Additional context I check the source code of deepspeed and find two free_param(param) operations in deepspeed/runtime/zero/partition_parameters.py, line 1115 and 1186, where the parameters are turned into torch.empty(0). The It seems that the params aren't restored after this operation, and remain empty till the above error occurs.

publicstaticvo avatar Apr 18 '23 08:04 publicstaticvo

@cmikeh2 this is another error with zero_stage=3

publicstaticvo avatar Apr 20 '23 17:04 publicstaticvo

@publicstaticvo hello,I also have this error, may I ask if you have solved it?

SAXSUN avatar Apr 21 '23 02:04 SAXSUN

I deleted free_param(param) at line 1115 of deepspeed/runtime/zero/partition_parameters.py and it seems to work, but I don't know if it is the right solution. For example, I met another Cuda Out Of Memory problem later, and I don't know if that was the reason.

@SAXSUN which base model are you using? cmikeh2 said the project currently only supports Meta OPT models in another issue (#3284).

publicstaticvo avatar Apr 21 '23 10:04 publicstaticvo

actor_model dolly7b,ritic_model opt350m

SAXSUN avatar Apr 21 '23 11:04 SAXSUN

when will this bug be solved?

ciayomin avatar Jun 30 '23 07:06 ciayomin