ColossalAI [BUG]: colossal cannot split tensor evenly when using Sequential Parallelism in hybirdplugin

Is there an existing issue for this bug?

[x] I have searched the existing issues

The bug has not been fixed in the latest main branch

[x] I have checked the latest main branch

Do you feel comfortable sharing a concise (minimal) script that reproduces the error? :)

No, I prefer not to share.

🐛 Describe the bug

I want to finetune a llama8b model using 4 H800, using SP = 4 in HybridParallelPlugin, I meet such a bug: [rank3]: Traceback (most recent call last): [rank3]: File "/data/nobody/project/project2/colossal_finetune/Finetune_colossal.py", line 878, in [rank3]: main() [rank3]: File "/data/nobody/project/project2/colossal_finetune/Finetune_colossal.py", line 874, in main [rank3]: train_model(model, train_loader, optimizer, scheduler, booster, tokenizer, args) [rank3]: File "/data/nobody/project/project2/colossal_finetune/Finetune_colossal.py", line 763, in train_model [rank3]: outputs = model(**batch) [rank3]: File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl [rank3]: return self._call_impl(*args, **kwargs) [rank3]: File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl [rank3]: return forward_call(*args, **kwargs) [rank3]: File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/colossalai/booster/plugin/hybrid_parallel_plugin.py", line 222, in forward [rank3]: return super().forward(*args, **kwargs) [rank3]: File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/colossalai/interface/model.py", line 127, in forward [rank3]: return self.module(*args, **kwargs) [rank3]: File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl [rank3]: return self._call_impl(*args, **kwargs) [rank3]: File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl [rank3]: return forward_call(*args, **kwargs) [rank3]: File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/colossalai/shardformer/modeling/llama.py", line 329, in llama_for_causal_lm_forward [rank3]: outputs = LlamaPipelineForwards.llama_model_forward( [rank3]: File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/colossalai/shardformer/modeling/llama.py", line 161, in llama_model_forward [rank3]: hidden_states = split_forward_gather_backward( [rank3]: File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/colossalai/shardformer/layer/_operation.py", line 1363, in split_forward_gather_backward [rank3]: return SplitForwardGatherBackward.apply(input, dim, process_group, grad_scale, fp8_communication) [rank3]: File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply [rank3]: return super().apply(*args, **kwargs) # type: ignore[misc] [rank3]: File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/colossalai/shardformer/layer/_operation.py", line 999, in forward [rank3]: return split(input, dim, process_group) [rank3]: File "/home/nobody/miniconda3/envs/framework/lib/python3.10/site-packages/colossalai/shardformer/layer/_operation.py", line 1199, in _split [rank3]: assert dim_size % world_size == 0, ( [rank3]: AssertionError: The dimension to split (1775) is not a multiple of world size (2), cannot split tensor evenly I have padded the sequences to the length of 4096, but it seems that it will split my sequences in different lengths. So how can I fixed such a problem?

Environment

No response

Aug 26 '25 04:08 Hugo-cell111

Hi, just wonder which mode are you using for SP ?

We have implemented several sp mode such as alltoall, ring attention. Different ones may require different configs. In this case, seems that you also somehow trigger pp as well.

Aug 26 '25 05:08 TongLi3701

@TongLi3701 Hi! Thanks for your comment. Today I have tried SP mode, and I have tested 4 kinds of sp mode including ring, split_gather, all_to_all，ring_attn. For all_to_all and ring_attn, I have tried the pure sp mode and met the same bug. For ring and split_gather, since they must be applied together with tensor parallelism, I have also tried but unfortunately met the same bugs

Aug 27 '25 06:08 Hugo-cell111