DeepSpeed [BUG] Incompatibility Between DeepSpeed AutoTP and BLOOM in Training of Hugging Face models

I’ve encountered an issue while attempting to train BLOOM-1.7B using DeepSpeed’s AutoTP training of huggingface models functionality. The same setup works for LLaMA 2, but fails for BLOOM due to a shape mismatch in the ALiBi tensor.

To Reproduce

Environment Model: bigscience/bloom-1b7 (same issue for bloom-560m) Transformers: 4.51.2 (as recommended in AutoTP blog) DeepSpeed: latest (as of April 2025) PyTorch: 2.2.0 CUDA: 12.1 GPUs: 2×V100 Tensor Parallel Config:

"tensor_parallel": {
  "autotp_size": 2,
  "replace_with_kernel_inject": true
}

Run command (via SLURM):

deepspeed --num_gpus=2 train.py

Got this error:

RuntimeError: The expanded size of the tensor (8) must match the existing size (16) at non-singleton dimension 0.  Target sizes: [8, 512, 512].  Tensor sizes: [16, 1, 512]

It points to:

attention_scores = alibi.baddbmm(...)

The ALiBi shape mismatch seems to occur due to head duplication when pretraining_tp=2 and AutoTP are combined, but BLOOM isn’t automatically patched like LLaMA.
Downgrading transformers to 4.43.4 does fix inference, but not training with AutoTP, which requires transformers >= 4.50.1 per your blog.
DeepSpeed documentation lists BLOOM as supported, but BLOOM support was explicitly dropped after 4.43.4 due to ALiBi incompatibility.

Transformers version 4.51.2 exceeds version 4.43.4! After transformers version 4.43.4, BLOOM inference with DeepSpeed is no longer supported.

What works

BLOOM inference with DeepSpeed AutoTP + Transformers <= 4.43.4
AutoTP training with LLaMA 2 + Transformers >= 4.50.1

What doesn't work

BLOOM training with AutoTP + Transformers >= 4.50.1

Expected behavior If BLOOM is listed as AutoTP-compatible, it should:

Either patch or reject ALiBi tensor sizing dynamically during training
Or list a clear transformers version constraint in the docs for training

System info (please complete the following information):

OS: [Rocky Linux 9.4]
GPU count and types [one machines with x2 V100s]

Launcher context DeepSpeed launcher via slurm.

Apr 23 '25 11:04 tarnimat-hatem

could you try with replace_with_kernel_inject=False ?

Apr 25 '25 05:04 inkcherry

could you try with replace_with_kernel_inject=False ? @inkcherry

I tried that, still got the same error:

[rank0]: RuntimeError: The expanded size of the tensor (8) must match the existing size (16) at non-singleton dimension 0.  Target sizes: [8, 512, 512].  Tensor sizes: [16, 1, 512]

Apr 29 '25 12:04 tarnimat-hatem