[BUG]: Llama3.1 HybridParallelPlugin train failed when pp_size>1
Is there an existing issue for this bug?
- [X] I have searched the existing issues
🐛 Describe the bug
pp=2 tp=2 sp=1 zero_stage=0
[rank6]: File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/modeling/llama.py", line 93, in llama_model_forward [rank6]: input_shape = hidden_states.shape[:-1] [rank6]: AttributeError: 'NoneType' object has no attribute 'shape'
Environment
transformers 4.39.3 torch 2.4.0a0+3bcc3cddb5.nv24.7 colossalai 0.4.5
Hi, could you share us more detail about your code?
Did you use shardformer itself or use any of our examples?
AutoModelForSequenceClassification
| Chenglei Ye HIK RESEARCH INST - NLP 15802263906 @.*** Hangzhou, Zhejiang, China | ---- Replied Message ---- | From | Tong @.> | | Date | 11/02/2024 17:27 | | To | hpcaitech/ColossalAI @.> | | Cc | cingtiye @.>, Author @.> | | Subject | Re: [hpcaitech/ColossalAI] [BUG]: Llama3.1 HybridParallelPlugin train failed when pp_size>1 (Issue #6110) |
Hi, could you share us more detail about your code?
Did you use shardformer itself or use any of our examples?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
AutoModelForSequenceClassification | Chenglei Ye HIK RESEARCH INST - NLP 15802263906 @.*** Hangzhou, Zhejiang, China | ---- Replied Message ---- | From | Tong @.> | | Date | 11/02/2024 17:27 | | To | hpcaitech/ColossalAI @.> | | Cc | cingtiye @.>, Author @.> | | Subject | Re: [hpcaitech/ColossalAI] [BUG]: Llama3.1 HybridParallelPlugin train failed when pp_size>1 (Issue #6110) | Hi, could you share us more detail about your code? Did you use shardformer itself or use any of our examples? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
My code is ColossalAI/applications/examples/training_scripts/train_rm.py. But I use LlamaForSequenceClassification substitute RewardModel.
AutoModelForSequenceClassification | Chenglei Ye HIK RESEARCH INST - NLP 15802263906 @.*** Hangzhou, Zhejiang, China | ---- Replied Message ---- | From | Tong @.> | | Date | 11/02/2024 17:27 | | To | hpcaitech/ColossalAI _@**._> | | Cc | cingtiye _@.>, Author @._> | | Subject | Re: [hpcaitech/ColossalAI] [BUG]: Llama3.1 HybridParallelPlugin train failed when pp_size>1 (Issue #6110) | Hi, could you share us more detail about your code? Did you use shardformer itself or use any of our examples? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: _@_.*>
My code is ColossalAI/applications/examples/training_scripts/train_rm.py. But I use
LlamaForSequenceClassificationsubstitute RewardModel.
@TongLi3701 Could you please reply to me?
Hi, we are trying to figure it out.
We will have a test on this. Based on my initial guess, it might because of the following part:
https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/shardformer/modeling/llama.py#L90C13-L95
We will need to add
hidden_states = inputs_embeds into the else part.
Hi, we are trying to figure it out.
We will have a test on this. Based on my initial guess, it might because of the following part:
https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/shardformer/modeling/llama.py#L90C13-L95
We will need to add
hidden_states = inputs_embedsinto the else part.
Are you sure?
The 91 line is already hidden_states = inputs_embeds.
if inputs_embeds is None:
inputs_embeds = self.embed_tokens(input_ids)
hidden_states = inputs_embeds
device = hidden_states.device
Firstly, it seems that ColossalAI/applications/examples/training_scripts/train_rm.py is not found in the main branch. Your error is due to PP stage 2 not receiving input from stage 1. Your case (pp = 2, tp = 2, dp = 2) is indeed covered in unit tests, so you will need to share how your code differs.
To debug, you can use torch.distributed.breakpoint(rank=6) in PP schedule to check in what case self.recv_forward returns None for input_obj. These will make it easier for us to help you.
Firstly, it seems that
ColossalAI/applications/examples/training_scripts/train_rm.pyis not found the main branch. Your error is due to PP stage 2 not receiving input from stage 1. Your case (pp = 2, tp = 2, dp = 2) is indeed covered in unit tests, so you will need to share how your code differs. To debug, you can usetorch.distributed.breakpoint(rank=6)in PP schedule to check in what caseself.recv_forwardreturnsNoneforinput_obj. These will make it easier for us to help you.
https://github.com/hpcaitech/ColossalAI/blob/main/applications/ColossalChat/examples/training_scripts/train_rm.py
Hi, we are trying to figure it out.
We will have a test on this. Based on my initial guess, it might because of the following part:
https://github.com/hpcaitech/ColossalAI/blob/main/colossalai/shardformer/modeling/llama.py#L90C13-L95
We will need to add
hidden_states = inputs_embedsinto the else part.
[rank1]: Traceback (most recent call last):
[rank1]: File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/examples/training_scripts/train_rm.py", line 392, in <module>
[rank1]: train(args)
[rank1]: File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/examples/training_scripts/train_rm.py", line 320, in train
[rank1]: trainer.fit(
[rank1]: File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/coati/trainer/base.py", line 67, in fit
[rank1]: self._train(epoch)
[rank1]: File "/data1/Projects/mcts-llm/ColossalAI/applications/ColossalChat/coati/trainer/rm.py", line 133, in _train
[rank1]: reward = self.model(
[rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1552, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1561, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: File "/usr/local/lib/python3.10/dist-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 220, in forward
[rank1]: return super().forward(*args, **kwargs)
[rank1]: File "/usr/local/lib/python3.10/dist-packages/colossalai/interface/model.py", line 25, in forward
[rank1]: return self.module(*args, **kwargs)
[rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1552, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1561, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 1388, in forward
[rank1]: transformer_outputs = self.model(
[rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1552, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1561, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: File "/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/modeling/llama.py", line 99, in llama_model_forward
[rank1]: inputs_embeds = self.embed_tokens(input_ids)
[rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1552, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1561, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/sparse.py", line 163, in forward
[rank1]: return F.embedding(
[rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 2267, in embedding
[rank1]: return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
[rank1]: TypeError: embedding(): argument 'weight' (position 1) must be Tensor, not NoneType
Thank you, we'll fix it soon.
Thank you, we'll fix it soon.
What’s the progress like?
Thank you, we'll fix it soon.
any Colossalai-ers could help me? Thanks a lot.
Thank you, we'll fix it soon.
What’s the progress like?
LlamaForSequenceClassification
Did you use the weights from LlamaForSequenceClassification, or did you modify the reward model code?
LlamaForSequenceClassification
Did you use the weights from LlamaForSequenceClassification, or did you modify the reward model code?
I substitute the RewardModel by LlamaForSequenceClassification. In fact, it can't correctly run even if I don't substitute RewardModel.
https://github.com/hpcaitech/ColossalAI/blob/main/applications/ColossalChat/examples/training_scripts/train_rm.py
with init_ctx:
if args.use_flash_attn:
model = RewardModel(
args.pretrain,
torch_dtype=torch.bfloat16 if args.mixed_precision == "bf16" else torch.float16,
use_flash_attention_2=True,
)
coordinator.print_on_master(msg="Flash-attention enabled successfully")
else:
model = RewardModel(
args.pretrain,
)
LlamaForSequenceClassification
Did you use the weights from LlamaForSequenceClassification, or did you modify the reward model code?
You can try to run https://github.com/hpcaitech/ColossalAI/blob/main/applications/ColossalChat/examples/training_scripts/train_rm.py by set pp>1 in one node.
LlamaForSequenceClassification
Did you use the weights from LlamaForSequenceClassification, or did you modify the reward model code?
May I ask if you have run train_rm.py? Have you encountered the same issue as me when pp > 1?