transformers device_map='auto' coupled with tp

Feature request

Hi from pytorch distributed! Thanks for showcasing pytorch APIs

device_map='auto' and tp_plan='auto' are somehow coupled right now: https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_utils.py#L4280-L4283

if people load model with device_map='auto', they will ended up with a TP-enabled model with model.parameters() as DTensor. Once they train it, DDP will complain about DTensor. If they set device_map='cuda', the model is loaded as plain module without TP and it works with DDP

Curious if we want to decouple device_map='auto' with tp_plan='auto'? I found this from pytorch users https://github.com/pytorch/pytorch/issues/155463

Motivation

d

Your contribution

d

Jun 11 '25 22:06 weifengpy

cc @ArthurZucker @SunMarc if you have strong opinions to couple device_map='auto' and tp_plan='auto'

Jun 11 '25 22:06 weifengpy

cc @sunmarc @Cyrilvallez

Jun 12 '25 13:06 Rocketknight1

Humm, no strong opinion on my side, but as it was added as a convenience at the beginning when adding TP support, it is now part of several examples and codebases... However, it's true that we did not think of this possibility - let's wait for @ArthurZucker to see if we want to start decoupling them

Jun 12 '25 15:06 Cyrilvallez

Should be "fixed" by this PR on the accelerate side. The error will still be there, it's just gonna be explanatory enough.

Jun 15 '25 03:06 S1ro1

Just providing some additonal info: If this issue is caused by the PyTorch DDP not knowing about DTensor (i.e. loaded model have TP + DP), there's currently one private API call that makes PyTorch DDP aware of DTensor https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/parallel/ddp.py#L70 Feel free to try this if you want the PyTorch DDP + TP to work instead of raising errors

Jun 16 '25 20:06 wanchaol

I think we still should rather raise an error. I think it breaks w/ our dataloader + I'd assume most of the users don't know device_map="auto" applies TP.

Jun 16 '25 22:06 S1ro1

It's probably better to decouple those ! It can be very confusing to the users that device_map can work differently when running in distributed mode or not. Also it will be better for the user to know that they are actually using tp by specifying tp_plan or tp_size.

Jun 17 '25 09:06 SunMarc

if people load model with device_map='auto', they will ended up with a TP-enabled model with model.parameters() as DTensor.

device_map = "auto" only affects the script if you do run torchdistributed, but yes it does make sense to decouple! Can you take care of it @SunMarc ?

Jun 20 '25 00:06 ArthurZucker

I am in favor of decoupling them as well. Just it's hard to associate device to tensor parallel.

Appreciated if we can make the change. In the pytorch github issue, people got confused about dtensor error in DDP training. many people hit the same issue

Jun 20 '25 10:06 weifengpy

device_map = "auto" only affects the script if you do run torchdistributed, but yes it does make sense to decouple!

That might actually happen for training since users will run the script with accelerate launch for ddp for example

Will open a PR !

Jun 20 '25 13:06 SunMarc

My question is: why not add something so that DTensor and DDP can work together?

it's more about user expectation. when setting device_map=auto, whether they are asking to enable TP. I feel TP and device placement are two separate things

Jun 23 '25 05:06 weifengpy

I'm still getting an error related to this. Please see https://github.com/pytorch/pytorch/issues/155463#issuecomment-3060504993.

I still got this error AssertionError: found no DeviceMesh from dtensor args for c10d.broadcast_.default! if I use tp_plan='auto' and not provide device_map. The cause of this error is exactly what @twoflypig mentioned above.

If I add device_map (set to any value), then I got this error: ValueError: tp_plananddevice_map are mutually exclusive. Choose either one for parallelization.

Jul 11 '25 04:07 mistysesame

device_map='auto' coupled with tp_plan='auto'

Feature request

Motivation

Your contribution