ColossalAI
ColossalAI copied to clipboard
[BUG]: Pipeline Parallelism fails when input shape varies
Is there an existing issue for this bug?
- [X] I have searched the existing issues
🐛 Describe the bug
Pipeline parallelism fails when input size is different. Such as :
for batch in iter:
#batch1: bs*seq=1*128
#batch2: bs*seq=1*129
outputs = booster.execute_pipeline(batch, model)
Error message:
File "/home/zhangguangyao/colossal_llama_sp/ColossalAI/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 809, in backward_by_grad
super().backward_by_grad(tensor, grad)
File "/home/zhangguangyao/colossal_llama_sp/ColossalAI/colossalai/zero/low_level/low_level_optim.py", line 436, in backward_by_grad
torch.autograd.backward(tensor, grad)
File "/home/zhangguangyao/miniconda3/envs/llama_sp/lib/python3.10/site-packages/torch/autograd/__init__.py", line 244, in backward
grad_tensors_ = _make_grads(tensors, grad_tensors_, is_grads_batched=False)
File "/home/zhangguangyao/miniconda3/envs/llama_sp/lib/python3.10/site-packages/torch/autograd/__init__.py", line 88, in _make_grads
raise RuntimeError(
RuntimeError: Mismatch in shape: grad_output[0] has a shape of torch.Size([1, 128, 768]) and output[0] has a shape of torch.Size([1, 129, 768]).
Environment
ColossalAI master branch
Have you tried setting enable_metadata_cache to False?