DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

AttributeError: 'DeepSpeedHybridEngine' object has no attribute 'mp_group' in step 3.

Open Arain-sh opened this issue 1 year ago • 9 comments

│ /data/miniconda3/envs/arainmodel/lib/python3.10/site-packages/deepspeed/runtime/hybrid_engine.py │ │ :296 in create_inference_containers │ │ │ │ 293 │ │ │ │ │ self._orig_modules_others.append(child) │ │ 294 │ │ │ │ │ self._orig_fwds_others.append(child.forward) │ │ 295 │ │ │ else: │ │ ❱ 296 │ │ │ │ self.create_inference_containers(child, layer_id=layer_id) │ │ 297 │ │ │ 298 │ def create_inference_module(self): │ │ 299 │ │ self.layer_params = [] │ │ │ │ /data/miniconda3/envs/arainmodel/lib/python3.10/site-packages/deepspeed/runtime/hybrid_engine.py │ │ :276 in create_inference_containers │ │ │ │ 273 │ │ for name, child in module.named_children(): │ │ 274 │ │ │ if child.class in self.inference_policies: │ │ 275 │ │ │ │ if self.inference_policies[child.class][0] == self.new_inference_con │ │ ❱ 276 │ │ │ │ │ self._inference_containers.append(self.inference_policies[child.__cl │ │ 277 │ │ │ │ │ │ child, self.inference_policies[child.class][-1], layer_id)) │ │ 278 │ │ │ │ │ self._orig_modules.append(child) │ │ 279 │ │ │ │ │ self._orig_fwds.append(child.forward) │ │ │ │ /data/miniconda3/envs/arainmodel/lib/python3.10/site-packages/deepspeed/runtime/hybrid_engine.py │ │ :97 in new_inference_container │ │ │ │ 94 │ │ │ child=orig_layer) │ │ 95 │ │ _container.set_dtype(self._config.fp16_enabled) │ │ 96 │ │ │ │ ❱ 97 │ │ _container.set_tensor_parallel_config(self._config.hybrid_engine.inference_tp_si │ │ 98 │ │ _container.initialize_tensors(enable_training=True) │ │ 99 │ │ _container.create_ds_model_config() │ │ 100 │ │ _container.create_module() │ │ │ │ /data/miniconda3/envs/arainmodel/lib/python3.10/site-packages/deepspeed/runtime/engine.py:454 in │ │ getattr │ │ │ │ 451 │ │ elif name in dir(_module): │ │ 452 │ │ │ return getattr(_module, name) │ │ 453 │ │ else: │ │ ❱ 454 │ │ │ raise AttributeError(f"'{type(self).name}' object has no attribute '{nam │ │ 455 │ │ │ 456 │ def checkpoint_tag_validation_enabled(self): │ │ 457 │ │ return self._config.checkpoint_tag_validation_enabled │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ AttributeError: 'DeepSpeedHybridEngine' object has no attribute 'mp_group'

Arain-sh avatar Apr 21 '23 03:04 Arain-sh

I'll add more details on my experiment: when I'm training 13b, it'll exit raising the aforementioned exception. Commandline (I've finished step1 and step2):

$ python3 train.py --actor-model facebook/opt-13b --reward-model facebook/opt-350m --deployment-type single_node --step 3

And I can skip this error by removing --enable-hybric-engine, but it'll then exit with Gmem OOM (I'm using A100 80G, according to documentation, it should work).

related issue: #375

wooparadog avatar Apr 21 '23 07:04 wooparadog

I'll add more details on my experiment: when I'm training 13b, it'll exit raising the aforementioned exception. Commandline (I've finished step1 and step2):

$ python3 train.py --actor-model facebook/opt-13b --reward-model facebook/opt-350m --deployment-type single_node --step 3

And I can skip this error by removing --enable-hybric-engine, but it'll then exit with Gmem OOM (I'm using A100 80G, according to documentation, it should work).

same problem.

Arain-sh avatar Apr 21 '23 07:04 Arain-sh

Hi @wooparadog and @Arain-sh,

Can you please give me the config that you are running this with so that I can try to reproduce it on my end? Thanks, Reza

RezaYazdaniAminabadi avatar Apr 24 '23 02:04 RezaYazdaniAminabadi

I did try the same on my side, using this script and could not reproduce the same issue on my side!

RezaYazdaniAminabadi avatar Apr 24 '23 03:04 RezaYazdaniAminabadi

I did try the same on my side, using this script and could not reproduce the same issue on my side!

It seems the actor and critic model is not specified in this script.

I uses --actor-model facebook/opt-13b --reward-model facebook/opt-350m as @Arain-sh mentioned and is able to reproduce the issue.

zw0610 avatar Apr 25 '23 09:04 zw0610

I think the issue lies here with two possible conditions:

  1. inference_tp_size > 1 & inference_tp_size > world_size As num_mp_groups goes 1, the self.mp_group assignment is skipped

  2. inference_tp_size > 1 & world_size is not divided by inference_to_size For worker with global_rank > num_mp_groups * inference_to_size, the self.mp_group assignment is skipped too

zw0610 avatar Apr 25 '23 10:04 zw0610

zw0610

Setting inference_tp_size=1 solved no attribute mp_group problem. But another error occurs in the hybrid engine.

1067   File "/home/formath/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 73, in _generate_sequence
1068     seq = self.actor_model.module.generate(prompts,
1069   File "/conda/envs/py39/lib/python3.9/site-packages/deepspeed/runtime/hybrid_engine.py", line 250, in generate
1070     with GatheredParameters(non_active_layers):
1071   File "/conda/envs/py39/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1667, in __enter__
1072     self.params[0].all_gather(param_list=self.params)
1073   File "/conda/envs/py39/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 890, in all_gather
1074     return self._all_gather(param_list, async_op=async_op, hierarchy=hierarchy)
1075   File "/conda/envs/py39/lib/python3.9/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
1076     ret_val = func(*args, **kwargs)
1077   File "/conda/envs/py39/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1092, in _all_gather
1078     ret_value = self._allgather_params_coalesced(all_gather_list, hierarchy)
1079   File "/conda/envs/py39/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1289, in _allgather_params_coalesced
1080     handle = _no_gather_coalesced(param_list)
1081   File "/conda/envs/py39/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 605, in _no_gather_coalesced
1082     raise RuntimeError(param.ds_summary())
1083 RuntimeError: {'id': 653, 'status': 'INFLIGHT', 'numel': 0, 'ds_numel': 655360, 'shape': (0,), 'ds_shape': (5120, 128), 'requires_grad': True, 'grad_shape': None, 'persist': False,      'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([655360])}

formath avatar Jul 12 '23 06:07 formath

I'll add more details on my experiment: when I'm training 13b, it'll exit raising the aforementioned exception. Commandline (I've finished step1 and step2):

$ python3 train.py --actor-model facebook/opt-13b --reward-model facebook/opt-350m --deployment-type single_node --step 3

And I can skip this error by removing --enable-hybric-engine, but it'll then exit with Gmem OOM (I'm using A100 80G, according to documentation, it should work).

related issue: #375

I run deepspeed-chat on actor-13b and reward-350m successfully by those.

  • Install deepspeed from source using master branch to fix this issue https://github.com/microsoft/DeepSpeed/issues/3156
  • Removing --enable-hybric-engine
  • Adding --offload_reference_model

formath avatar Jul 12 '23 07:07 formath

try adjusting the --inference_tp_size to a lower number, it may be you don't have enough GPUs across your nodes.

kkk935208447 avatar Mar 18 '24 01:03 kkk935208447