ColossalAI [BUG]: Llama3.1 HybridParallelPlugin train failed when pp

Is there an existing issue for this bug?

[X] I have searched the existing issues

🐛 Describe the bug

pp=2 tp=2 sp=1 zero_stage=0

[rank6]: File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/modeling/llama.py", line 93, in llama_model_forward [rank6]: input_shape = hidden_states.shape[:-1] [rank6]: AttributeError: 'NoneType' object has no attribute 'shape'

Environment

transformers 4.39.3 torch 2.4.0a0+3bcc3cddb5.nv24.7 colossalai 0.4.5

Nov 02 '24 08:11 cingtiye

Hi, could you share us more detail about your code?

Did you use shardformer itself or use any of our examples?

Nov 02 '24 09:11 TongLi3701

AutoModelForSequenceClassification

| Chenglei Ye HIK RESEARCH INST - NLP 15802263906 @.*** Hangzhou, Zhejiang, China | ---- Replied Message ---- | From | Tong @.> | | Date | 11/02/2024 17:27 | | To | hpcaitech/ColossalAI @.> | | Cc | cingtiye @.>, Author @.> | | Subject | Re: [hpcaitech/ColossalAI] [BUG]: Llama3.1 HybridParallelPlugin train failed when pp_size>1 (Issue #6110) |

Hi, could you share us more detail about your code?

Did you use shardformer itself or use any of our examples?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

Nov 02 '24 10:11 cingtiye

AutoModelForSequenceClassification | Chenglei Ye HIK RESEARCH INST - NLP 15802263906 @.*** Hangzhou, Zhejiang, China | ---- Replied Message ---- | From | Tong @.> | | Date | 11/02/2024 17:27 | | To | hpcaitech/ColossalAI @.> | | Cc | cingtiye @.>, Author @.> | | Subject | Re: [hpcaitech/ColossalAI] [BUG]: Llama3.1 HybridParallelPlugin train failed when pp_size>1 (Issue #6110) | Hi, could you share us more detail about your code? Did you use shardformer itself or use any of our examples? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

My code is ColossalAI/applications/examples/training_scripts/train_rm.py. But I use LlamaForSequenceClassification substitute RewardModel.

Nov 04 '24 01:11 cingtiye

AutoModelForSequenceClassification | Chenglei Ye HIK RESEARCH INST - NLP 15802263906 @.*** Hangzhou, Zhejiang, China | ---- Replied Message ---- | From | Tong @.> | | Date | 11/02/2024 17:27 | | To | hpcaitech/ColossalAI _@**._> | | Cc | cingtiye _@.>, Author @._> | | Subject | Re: [hpcaitech/ColossalAI] [BUG]: Llama3.1 HybridParallelPlugin train failed when pp_size>1 (Issue #6110) | Hi, could you share us more detail about your code? Did you use shardformer itself or use any of our examples? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: _@_.*>

My code is ColossalAI/applications/examples/training_scripts/train_rm.py. But I use LlamaForSequenceClassification substitute RewardModel.

@TongLi3701 Could you please reply to me?

Nov 05 '24 08:11 cingtiye

Hi, we are trying to figure it out.

We will have a test on this. Based on my initial guess, it might because of the following part:

https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/shardformer/modeling/llama.py#L90C13-L95

We will need to add

hidden_states = inputs_embeds into the else part.

Nov 06 '24 14:11 TongLi3701

Hi, we are trying to figure it out.

We will have a test on this. Based on my initial guess, it might because of the following part:

https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/shardformer/modeling/llama.py#L90C13-L95

We will need to add

hidden_states = inputs_embeds into the else part.

Are you sure? The 91 line is already hidden_states = inputs_embeds.

  if inputs_embeds is None:
      inputs_embeds = self.embed_tokens(input_ids)
  hidden_states = inputs_embeds
  device = hidden_states.device

Nov 08 '24 05:11 cingtiye

Firstly, it seems that ColossalAI/applications/examples/training_scripts/train_rm.py is not found in the main branch. Your error is due to PP stage 2 not receiving input from stage 1. Your case (pp = 2, tp = 2, dp = 2) is indeed covered in unit tests, so you will need to share how your code differs.
To debug, you can use torch.distributed.breakpoint(rank=6) in PP schedule to check in what case self.recv_forward returns None for input_obj. These will make it easier for us to help you.

Nov 10 '24 21:11 Edenzzzz

Firstly, it seems that ColossalAI/applications/examples/training_scripts/train_rm.py is not found the main branch. Your error is due to PP stage 2 not receiving input from stage 1. Your case (pp = 2, tp = 2, dp = 2) is indeed covered in unit tests, so you will need to share how your code differs. To debug, you can use torch.distributed.breakpoint(rank=6) in PP schedule to check in what case self.recv_forward returns None for input_obj. These will make it easier for us to help you.

https://github.com/hpcaitech/ColossalAI/blob/main/applications/ColossalChat/examples/training_scripts/train_rm.py

Nov 11 '24 02:11 cingtiye

Hi, we are trying to figure it out.

We will have a test on this. Based on my initial guess, it might because of the following part:

https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/shardformer/modeling/llama.py#L90C13-L95

We will need to add

hidden_states = inputs_embeds into the else part.

[rank1]: Traceback (most recent call last):
[rank1]:   File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/examples/training_scripts/train_rm.py", line 392, in <module>
[rank1]:     train(args)
[rank1]:   File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/examples/training_scripts/train_rm.py", line 320, in train
[rank1]:     trainer.fit(
[rank1]:   File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/coati/trainer/base.py", line 67, in fit
[rank1]:     self._train(epoch)
[rank1]:   File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/coati/trainer/rm.py", line 133, in _train
[rank1]:     reward = self.model(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1552, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1561, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 220, in forward
[rank1]:     return super().forward(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/colossalai/interface/model.py", line 25, in forward
[rank1]:     return self.module(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1552, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1561, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 1388, in forward
[rank1]:     transformer_outputs = self.model(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1552, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1561, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/modeling/llama.py", line 99, in llama_model_forward
[rank1]:     inputs_embeds = self.embed_tokens(input_ids)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1552, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1561, in _call_impl
[rank1]:     return forward_call(*args, **kwargs)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/sparse.py", line 163, in forward
[rank1]:     return F.embedding(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 2267, in embedding
[rank1]:     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
[rank1]: TypeError: embedding(): argument 'weight' (position 1) must be Tensor, not NoneType

Nov 11 '24 07:11 cingtiye

Thank you, we'll fix it soon.

Nov 11 '24 07:11 flybird11111

Thank you, we'll fix it soon.

What’s the progress like?

Nov 12 '24 10:11 cingtiye

Thank you, we'll fix it soon.

any Colossalai-ers could help me? Thanks a lot.

Nov 14 '24 02:11 cingtiye

Thank you, we'll fix it soon.

What’s the progress like?

Nov 18 '24 06:11 cingtiye

LlamaForSequenceClassification

Did you use the weights from LlamaForSequenceClassification, or did you modify the reward model code?

Nov 18 '24 07:11 flybird11111

LlamaForSequenceClassification

Did you use the weights from LlamaForSequenceClassification, or did you modify the reward model code?

I substitute the RewardModel by LlamaForSequenceClassification. In fact, it can't correctly run even if I don't substitute RewardModel.

https://github.com/hpcaitech/ColossalAI/blob/main/applications/ColossalChat/examples/training_scripts/train_rm.py

    with init_ctx:
        if args.use_flash_attn:
            model = RewardModel(
                args.pretrain,
                torch_dtype=torch.bfloat16 if args.mixed_precision == "bf16" else torch.float16,
                use_flash_attention_2=True,
            )
            coordinator.print_on_master(msg="Flash-attention enabled successfully")
        else:
            model = RewardModel(
                args.pretrain,
            )

Nov 18 '24 07:11 cingtiye

LlamaForSequenceClassification

Did you use the weights from LlamaForSequenceClassification, or did you modify the reward model code?

You can try to run https://github.com/hpcaitech/ColossalAI/blob/main/applications/ColossalChat/examples/training_scripts/train_rm.py by set pp>1 in one node.

Nov 18 '24 07:11 cingtiye

LlamaForSequenceClassification

Did you use the weights from LlamaForSequenceClassification, or did you modify the reward model code?

May I ask if you have run train_rm.py? Have you encountered the same issue as me when pp > 1？

Nov 19 '24 07:11 cingtiye