lighteval icon indicating copy to clipboard operation
lighteval copied to clipboard

[BUG] Model Parallelism is incorrectly disabled on single multi-GPU machines

Open huaanrui opened this issue 4 months ago • 0 comments

Describe the bug

When running an evaluation with accelerate launch on a single machine with multiple GPUs, setting model_parallel=True in the TransformersModelConfig has no effect. The library logs that it is "not in a distributed setting" and force-overrides model_parallel to False.

To Reproduce

Set up an evaluation script on a single machine with 2 or more GPUs. In the script, create a TransformersModelConfig for a large model (e.g., 7B+) and explicitly set model_parallel=True. Launch the script using accelerate launch --num_processes=2 your_script.py. Observe: We are not in a distributed setting. Setting model_parallel to False.

Expected behavior

lighteval should respect the model_parallel=True configuration in a single-node, multi-GPU environment. It should proceed to shard the model across the available GPUs using device_map="auto" instead of disabling the feature.

Version info

build from source code

huaanrui avatar Aug 05 '25 01:08 huaanrui