DeepSpeed [BUG] AutoTP: incorrect total train batch size when using the huggingface trainer API

Describe the bug When I try to use AutoTP with the huggingface trainer API, the trainer reports a seemingly incorrect total train batch size value.

2x 8-GPU nodes zero stage 2 autotp_size = 2 per_device_train_batch_size = 4 gradient_accumulation_steps = 1

For the setup above, based on my understanding, the total train batch size should be 2 (nodes) * 8 (gpus_per_node) / 2 (autotp_size) * 4 (per_device_train_batch_size) = 32. But instead, the trainer reports 64, which is quite weird to me.

Even if the total train batch size is indeed 64, it still does not seem to match the total optimization steps (as shown below) value.

Although the training can be successfully launched and finished, I am still quite worried about this mismatched batch size value. Can I safely ignore this batch size issue? If so, what is the actual total train batch size? If not, can anyone investigate why it reports like this and maybe provide a fix or suggestion?

To Reproduce Run the following command with the huggingface example CLM training script run_clm.py:

deepspeed --hostfile=myhostfile run_clm.py \
    --deepspeed ds_config_autotp2.json \
    --model_name_or_path Qwen/Qwen3-0.6B \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --do_train \
    --do_eval \
    --output_dir test-autotp

My ds_config_autotp2.json:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "bf16": {
        "enabled": "auto"
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },
    "zero_optimization": {
        "stage": 2,
        "gather_16bit_weights_on_model_save": true
    },
    "tensor_parallel": {
        "autotp_size": 2
    },
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 100
}

Expected behavior The training script should be successfully launched and finished, but the reported total train batch size is wrong based on my understanding.

Below is the total train batch size report from my run:

node028: [INFO|trainer.py:2414] 2025-05-21 01:43:49,230 >> ***** Running training *****
node028: [INFO|trainer.py:2415] 2025-05-21 01:43:49,230 >>   Num examples = 2,442
node028: [INFO|trainer.py:2416] 2025-05-21 01:43:49,231 >>   Num Epochs = 3
node028: [INFO|trainer.py:2417] 2025-05-21 01:43:49,231 >>   Instantaneous batch size per device = 4
node028: [INFO|trainer.py:2420] 2025-05-21 01:43:49,231 >>   Total train batch size (w. parallel, distributed & accumulation) = 64
node028: [INFO|trainer.py:2421] 2025-05-21 01:43:49,231 >>   Gradient Accumulation steps = 1
node028: [INFO|trainer.py:2422] 2025-05-21 01:43:49,231 >>   Total optimization steps = 231
node028: [INFO|trainer.py:2423] 2025-05-21 01:43:49,232 >>   Number of trainable parameters = 375,848,960
node032: [INFO|trainer.py:2414] 2025-05-21 01:43:53,413 >> ***** Running training *****
node032: [INFO|trainer.py:2415] 2025-05-21 01:43:53,413 >>   Num examples = 2,442
node032: [INFO|trainer.py:2416] 2025-05-21 01:43:53,413 >>   Num Epochs = 3
node032: [INFO|trainer.py:2417] 2025-05-21 01:43:53,413 >>   Instantaneous batch size per device = 4
node032: [INFO|trainer.py:2420] 2025-05-21 01:43:53,413 >>   Total train batch size (w. parallel, distributed & accumulation) = 64
node032: [INFO|trainer.py:2421] 2025-05-21 01:43:53,413 >>   Gradient Accumulation steps = 1
node032: [INFO|trainer.py:2422] 2025-05-21 01:43:53,413 >>   Total optimization steps = 231
node032: [INFO|trainer.py:2423] 2025-05-21 01:43:53,413 >>   Number of trainable parameters = 375,848,960

ds_report output

[2025-05-21 01:57:42,404] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
dc ..................... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
 [WARNING]  FP Quantizer is using an untested triton version (3.2.0), only 2.3.(0, 1) and 3.0.0 are known to be compatible with these kernels
fp_quantizer ........... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
 [WARNING]  gds requires the dev libaio .so object and headers but these were not found.
 [WARNING]  gds: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
gds .................... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.6
 [WARNING]  using untested triton version (3.2.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/scratch/dyvm6xra/dyvm6xrauser04/miniforge3/envs/test_1/lib/python3.12/site-packages/torch']
torch version .................... 2.6.0+cu124
deepspeed install path ........... ['/scratch/dyvm6xra/dyvm6xrauser04/miniforge3/envs/test_1/lib/python3.12/site-packages/deepspeed']
deepspeed info ................... 0.16.7, unknown, unknown
torch cuda version ............... 12.4
torch hip version ................ None
nvcc version ..................... 12.4
deepspeed wheel compiled w. ...... torch 0.0, cuda 0.0
shared memory (/dev/shm) size .... 1007.78 GB

System info:

OS: Ubuntu 22.04.4 LTS
two machines with x8 H800s each
Python 3.12
transformers 4.51.3
accelerate 1.6.0
deepspeed 0.16.7
torch 2.6.0+cu124

May 20 '25 18:05 cynricfu

@inkcherry Do you know what might cause this inconsistency?

May 22 '25 03:05 delock

hi @cynricfu ,thanks for the report. This is likely due to the Transformer display logic using total_batch_size without accounting for dp_world_size != world_size You can ignore it for now — it only affects display. I just sent a fix to transformers

before the fix (total bs= 64 is correct) ***** Running training ***** Num examples = 52,002 Num Epochs = 3 Instantaneous batch size per device = 8 Total train batch size (w. parallel, distributed & accumulation) = 256 Gradient Accumulation steps = 4 Total optimization steps = 2,436 Number of trainable parameters = 3,500,047,360

loss at step 20 {'loss': 0.9875, 'grad_norm': 1.8784213066101074, 'learning_rate': 1.3920478471778e-05, 'epoch': 0.02}

after the fix Num examples = 52,002 Num Epochs = 3 Instantaneous batch size per device = 8 Total train batch size (w. parallel, distributed & accumulation) = 64 Gradient Accumulation steps = 4 Total optimization steps = 2,439 Number of trainable parameters = 3,500,047,360

loss at step 20 {'loss': 0.9875, 'grad_norm': 1.8784213066101074, 'learning_rate': 1.3920478471778e-05, 'epoch': 0.02}

May 22 '25 08:05 inkcherry

hi @cynricfu ,thanks for the report. This is likely due to the Transformer display logic using total_batch_size without accounting for dp_world_size != world_size You can ignore it for now — it only affects display. I just sent a fix to transformers

before the fix (total bs= 64 is correct) ***** Running training ***** Num examples = 52,002 Num Epochs = 3 Instantaneous batch size per device = 8 Total train batch size (w. parallel, distributed & accumulation) = 256 Gradient Accumulation steps = 4 Total optimization steps = 2,436 Number of trainable parameters = 3,500,047,360

loss at step 20 {'loss': 0.9875, 'grad_norm': 1.8784213066101074, 'learning_rate': 1.3920478471778e-05, 'epoch': 0.02}

after the fix Num examples = 52,002 Num Epochs = 3 Instantaneous batch size per device = 8 Total train batch size (w. parallel, distributed & accumulation) = 64 Gradient Accumulation steps = 4 Total optimization steps = 2,439 Number of trainable parameters = 3,500,047,360

loss at step 20 {'loss': 0.9875, 'grad_norm': 1.8784213066101074, 'learning_rate': 1.3920478471778e-05, 'epoch': 0.02}

Many thanks for your clarification and quick fix!

I have some other questions (not directly related to this issue though) regarding the tensor parallelism support in transformers and deepspeed:

Could you advise what is the main difference between deepspeed's AutoTP and huggingface transformers' tp_plan (e.g., compatibility, efficiency, use case, etc.)? For example, can I combine deepspeed's zero stages with huggingface tp_plan? If yes, should I use [deepspeed zero 1/2 + AutoTP] or [deepspeed zero 1/2 + huggingface tp_plan]?

There seem to be multiple implementations of the same parallelism strategy. And I always get confused about which one to use and what is their compatibility matrix.

May 22 '25 10:05 cynricfu

If I remember correctly, tp_plan is for vllm integration, where tp_plan is used for inference time sharding.

AutoTP is originally for deepspeed's inference engine, just like vllm. Then recently @inkcherry and deepspeed team help extend that for training.

May 22 '25 11:05 skyshine102

If I remember correctly, tp_plan is for vllm integration, where tp_plan is used for inference time sharding.

AutoTP is originally for deepspeed's inference engine, just like vllm. Then recently @inkcherry and deepspeed team help extend that for training.

@skyshine102 Transformers people seem to have recently extended the tp_plan to support training (even with 3D parallelism - DP + TP + CP) https://github.com/huggingface/transformers/pull/37877. But I am not sure if it is mature enough. Their example also does not involve the trainer API or deepspeed integration.

May 22 '25 12:05 cynricfu

Hi @cynricfu can we mark this issue as completed?

Jun 10 '25 03:06 delock