transformers icon indicating copy to clipboard operation
transformers copied to clipboard

device_map='auto' coupled with tp_plan='auto'

Open weifengpy opened this issue 6 months ago • 4 comments

Feature request

Hi from pytorch distributed! Thanks for showcasing pytorch APIs

device_map='auto' and tp_plan='auto' are somehow coupled right now: https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_utils.py#L4280-L4283

if people load model with device_map='auto', they will ended up with a TP-enabled model with model.parameters() as DTensor. Once they train it, DDP will complain about DTensor. If they set device_map='cuda', the model is loaded as plain module without TP and it works with DDP

Curious if we want to decouple device_map='auto' with tp_plan='auto'? I found this from pytorch users https://github.com/pytorch/pytorch/issues/155463

Motivation

d

Your contribution

d

weifengpy avatar Jun 11 '25 22:06 weifengpy

cc @ArthurZucker @SunMarc if you have strong opinions to couple device_map='auto' and tp_plan='auto'

weifengpy avatar Jun 11 '25 22:06 weifengpy

cc @sunmarc @Cyrilvallez

Rocketknight1 avatar Jun 12 '25 13:06 Rocketknight1

Humm, no strong opinion on my side, but as it was added as a convenience at the beginning when adding TP support, it is now part of several examples and codebases... However, it's true that we did not think of this possibility - let's wait for @ArthurZucker to see if we want to start decoupling them

Cyrilvallez avatar Jun 12 '25 15:06 Cyrilvallez

Should be "fixed" by this PR on the accelerate side. The error will still be there, it's just gonna be explanatory enough.

S1ro1 avatar Jun 15 '25 03:06 S1ro1

Just providing some additonal info: If this issue is caused by the PyTorch DDP not knowing about DTensor (i.e. loaded model have TP + DP), there's currently one private API call that makes PyTorch DDP aware of DTensor https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/parallel/ddp.py#L70 Feel free to try this if you want the PyTorch DDP + TP to work instead of raising errors

wanchaol avatar Jun 16 '25 20:06 wanchaol

I think we still should rather raise an error. I think it breaks w/ our dataloader + I'd assume most of the users don't know device_map="auto" applies TP.

S1ro1 avatar Jun 16 '25 22:06 S1ro1

It's probably better to decouple those ! It can be very confusing to the users that device_map can work differently when running in distributed mode or not. Also it will be better for the user to know that they are actually using tp by specifying tp_plan or tp_size.

SunMarc avatar Jun 17 '25 09:06 SunMarc

if people load model with device_map='auto', they will ended up with a TP-enabled model with model.parameters() as DTensor.

device_map = "auto" only affects the script if you do run torchdistributed, but yes it does make sense to decouple! Can you take care of it @SunMarc ?

ArthurZucker avatar Jun 20 '25 00:06 ArthurZucker

I am in favor of decoupling them as well. Just it's hard to associate device to tensor parallel.

Appreciated if we can make the change. In the pytorch github issue, people got confused about dtensor error in DDP training. many people hit the same issue

weifengpy avatar Jun 20 '25 10:06 weifengpy

device_map = "auto" only affects the script if you do run torchdistributed, but yes it does make sense to decouple!

That might actually happen for training since users will run the script with accelerate launch for ddp for example

Will open a PR !

SunMarc avatar Jun 20 '25 13:06 SunMarc

My question is: why not add something so that DTensor and DDP can work together?

it's more about user expectation. when setting device_map=auto, whether they are asking to enable TP. I feel TP and device placement are two separate things

weifengpy avatar Jun 23 '25 05:06 weifengpy

I'm still getting an error related to this. Please see https://github.com/pytorch/pytorch/issues/155463#issuecomment-3060504993.

I still got this error AssertionError: found no DeviceMesh from dtensor args for c10d.broadcast_.default! if I use tp_plan='auto' and not provide device_map. The cause of this error is exactly what @twoflypig mentioned above.

If I add device_map (set to any value), then I got this error: ValueError: tp_plananddevice_map are mutually exclusive. Choose either one for parallelization.

mistysesame avatar Jul 11 '25 04:07 mistysesame