lighteval
lighteval copied to clipboard
[BUG] Automatic batch size detection causes distributed deadlock in model parallel mode
Describe the bug
When model parallelism is enabled and batch_size is not explicitly set in the config, the evaluation process hangs indefinitely. The log stops after the message "Detecting largest batch size...", indicating a distributed deadlock among the processes.
To Reproduce
- Configure an evaluation for a large, sharded model with model_parallel=True on a multi-GPU machine.
- Crucially, do not specify a batch_size in the
TransformersModelConfig, to allow the automatic detection feature to trigger. - Launch the script using
accelerate launch --num_processes > 1. - Observe that the script and all worker processes hang permanently with no further output after the batch size detection log message appears.
Expected behavior
The automatic batch size detection should either be compatible with model parallelism or be automatically skipped to prevent deadlocks. The evaluation should proceed with a safe default batch size (e.g., 1) or raise an error prompting the user to set a manual batch size when in model parallel mode.
Version info
build from source code