DeepSpeedExamples
DeepSpeedExamples copied to clipboard
AttributeError: 'DeepSpeedHybridEngine' object has no attribute 'mp_group' in step 3.
│ /data/miniconda3/envs/arainmodel/lib/python3.10/site-packages/deepspeed/runtime/hybrid_engine.py │ │ :296 in create_inference_containers │ │ │ │ 293 │ │ │ │ │ self._orig_modules_others.append(child) │ │ 294 │ │ │ │ │ self._orig_fwds_others.append(child.forward) │ │ 295 │ │ │ else: │ │ ❱ 296 │ │ │ │ self.create_inference_containers(child, layer_id=layer_id) │ │ 297 │ │ │ 298 │ def create_inference_module(self): │ │ 299 │ │ self.layer_params = [] │ │ │ │ /data/miniconda3/envs/arainmodel/lib/python3.10/site-packages/deepspeed/runtime/hybrid_engine.py │ │ :276 in create_inference_containers │ │ │ │ 273 │ │ for name, child in module.named_children(): │ │ 274 │ │ │ if child.class in self.inference_policies: │ │ 275 │ │ │ │ if self.inference_policies[child.class][0] == self.new_inference_con │ │ ❱ 276 │ │ │ │ │ self._inference_containers.append(self.inference_policies[child.__cl │ │ 277 │ │ │ │ │ │ child, self.inference_policies[child.class][-1], layer_id)) │ │ 278 │ │ │ │ │ self._orig_modules.append(child) │ │ 279 │ │ │ │ │ self._orig_fwds.append(child.forward) │ │ │ │ /data/miniconda3/envs/arainmodel/lib/python3.10/site-packages/deepspeed/runtime/hybrid_engine.py │ │ :97 in new_inference_container │ │ │ │ 94 │ │ │ child=orig_layer) │ │ 95 │ │ _container.set_dtype(self._config.fp16_enabled) │ │ 96 │ │ │ │ ❱ 97 │ │ _container.set_tensor_parallel_config(self._config.hybrid_engine.inference_tp_si │ │ 98 │ │ _container.initialize_tensors(enable_training=True) │ │ 99 │ │ _container.create_ds_model_config() │ │ 100 │ │ _container.create_module() │ │ │ │ /data/miniconda3/envs/arainmodel/lib/python3.10/site-packages/deepspeed/runtime/engine.py:454 in │ │ getattr │ │ │ │ 451 │ │ elif name in dir(_module): │ │ 452 │ │ │ return getattr(_module, name) │ │ 453 │ │ else: │ │ ❱ 454 │ │ │ raise AttributeError(f"'{type(self).name}' object has no attribute '{nam │ │ 455 │ │ │ 456 │ def checkpoint_tag_validation_enabled(self): │ │ 457 │ │ return self._config.checkpoint_tag_validation_enabled │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ AttributeError: 'DeepSpeedHybridEngine' object has no attribute 'mp_group'
I'll add more details on my experiment: when I'm training 13b, it'll exit raising the aforementioned exception. Commandline (I've finished step1 and step2):
$ python3 train.py --actor-model facebook/opt-13b --reward-model facebook/opt-350m --deployment-type single_node --step 3
And I can skip this error by removing --enable-hybric-engine
, but it'll then exit with Gmem OOM (I'm using A100 80G, according to documentation, it should work).
related issue: #375
I'll add more details on my experiment: when I'm training 13b, it'll exit raising the aforementioned exception. Commandline (I've finished step1 and step2):
$ python3 train.py --actor-model facebook/opt-13b --reward-model facebook/opt-350m --deployment-type single_node --step 3
And I can skip this error by removing
--enable-hybric-engine
, but it'll then exit with Gmem OOM (I'm using A100 80G, according to documentation, it should work).
same problem.
Hi @wooparadog and @Arain-sh,
Can you please give me the config that you are running this with so that I can try to reproduce it on my end? Thanks, Reza
I did try the same on my side, using this script and could not reproduce the same issue on my side!
I did try the same on my side, using this script and could not reproduce the same issue on my side!
It seems the actor and critic model is not specified in this script.
I uses --actor-model facebook/opt-13b --reward-model facebook/opt-350m
as @Arain-sh mentioned and is able to reproduce the issue.
I think the issue lies here with two possible conditions:
-
inference_tp_size
> 1 &inference_tp_size
>world_size
Asnum_mp_groups
goes1
, theself.mp_group
assignment is skipped -
inference_tp_size
> 1 &world_size
is not divided byinference_to_size
For worker withglobal_rank
>num_mp_groups * inference_to_size
, theself.mp_group
assignment is skipped too
zw0610
Setting inference_tp_size=1
solved no attribute mp_group
problem. But another error occurs in the hybrid engine.
1067 File "/home/formath/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 73, in _generate_sequence
1068 seq = self.actor_model.module.generate(prompts,
1069 File "/conda/envs/py39/lib/python3.9/site-packages/deepspeed/runtime/hybrid_engine.py", line 250, in generate
1070 with GatheredParameters(non_active_layers):
1071 File "/conda/envs/py39/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1667, in __enter__
1072 self.params[0].all_gather(param_list=self.params)
1073 File "/conda/envs/py39/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 890, in all_gather
1074 return self._all_gather(param_list, async_op=async_op, hierarchy=hierarchy)
1075 File "/conda/envs/py39/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
1076 ret_val = func(*args, **kwargs)
1077 File "/conda/envs/py39/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1092, in _all_gather
1078 ret_value = self._allgather_params_coalesced(all_gather_list, hierarchy)
1079 File "/conda/envs/py39/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1289, in _allgather_params_coalesced
1080 handle = _no_gather_coalesced(param_list)
1081 File "/conda/envs/py39/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 605, in _no_gather_coalesced
1082 raise RuntimeError(param.ds_summary())
1083 RuntimeError: {'id': 653, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 655360, 'shape': (0,), 'ds_shape': (5120, 128), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([655360])}
I'll add more details on my experiment: when I'm training 13b, it'll exit raising the aforementioned exception. Commandline (I've finished step1 and step2):
$ python3 train.py --actor-model facebook/opt-13b --reward-model facebook/opt-350m --deployment-type single_node --step 3
And I can skip this error by removing
--enable-hybric-engine
, but it'll then exit with Gmem OOM (I'm using A100 80G, according to documentation, it should work).related issue: #375
I run deepspeed-chat on actor-13b
and reward-350m
successfully by those.
- Install deepspeed from source using master branch to fix this issue https://github.com/microsoft/DeepSpeed/issues/3156
- Removing
--enable-hybric-engine
- Adding
--offload_reference_model
try adjusting the --inference_tp_size to a lower number, it may be you don't have enough GPUs across your nodes.