[BUG] Automatic batch size detection causes distributed deadlock in model parallel mode

Open huaanrui opened this issue 4 months ago • 0 comments

Describe the bug

When model parallelism is enabled and batch_size is not explicitly set in the config, the evaluation process hangs indefinitely. The log stops after the message "Detecting largest batch size...", indicating a distributed deadlock among the processes.

To Reproduce

Configure an evaluation for a large, sharded model with model_parallel=True on a multi-GPU machine.
Crucially, do not specify a batch_size in the TransformersModelConfig, to allow the automatic detection feature to trigger.
Launch the script using accelerate launch --num_processes > 1.
Observe that the script and all worker processes hang permanently with no further output after the batch size detection log message appears.

Expected behavior

The automatic batch size detection should either be compatible with model parallelism or be automatically skipped to prevent deadlocks. The evaluation should proceed with a safe default batch size (e.g., 1) or raise an error prompting the user to set a manual batch size when in model parallel mode.

Version info

build from source code

Aug 05 '25 01:08 huaanrui