[BUG]: Pytest with a specific config failed after PR #5868

Open GuangyaoZhang opened this issue 1 year ago • 0 comments

Is there an existing issue for this bug?

[X] I have searched the existing issues

🐛 Describe the bug

Main repo test_shard_llama fails for these configs:

{'tp_size': 2, 
'pp_size': 2, 
'sp_size': 2, 
'num_microbatches': 2, 
'enable_sequence_parallelism': True, 
'sequence_parallelism_mode': 'ring', 
'enable_flash_attention': True, 
'zero_stage': 1, 
'precision': 'fp16', 
'initial_scale': 1}

{'tp_size': 2,
 'sp_size': 2, 
'pp_size': 2, 
'num_microbatches': 2, 
'enable_sequence_parallelism': True, 
'sequence_parallelism_mode': 'split_gather', 
'enable_flash_attention': False, 
'precision': 'fp16', 
'initial_scale': 1}

The failure message is :

E         File "/home/nvme-share/home/zhangguangyao/ColossalAI/colossalai/shardformer/modeling/llama.py", line 530, in forward                
E           query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)                                               
E         File "/home/nvme-share/home/zhangguangyao/hf_transformers/src/transformers/models/llama/modeling_llama.py", line 206, in apply_rotary_pos_emb                                                                                                                                     
E           q_embed = (q * cos) + (rotate_half(q) * sin)                                                                                      
E       RuntimeError: The size of tensor a (16) must match the size of tensor b (8) at non-singleton dimension 2

I have found out this failure is introduced after PR #5868 merged. Please take a look.

Environment

No response

Jul 29 '24 09:07 GuangyaoZhang