device_map='auto' coupled with tp_plan='auto'
Feature request
Hi from pytorch distributed! Thanks for showcasing pytorch APIs
device_map='auto' and tp_plan='auto' are somehow coupled right now: https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_utils.py#L4280-L4283
if people load model with device_map='auto', they will ended up with a TP-enabled model with model.parameters() as DTensor. Once they train it, DDP will complain about DTensor. If they set device_map='cuda', the model is loaded as plain module without TP and it works with DDP
Curious if we want to decouple device_map='auto' with tp_plan='auto'? I found this from pytorch users https://github.com/pytorch/pytorch/issues/155463
Motivation
d
Your contribution
d
cc @ArthurZucker @SunMarc if you have strong opinions to couple device_map='auto' and tp_plan='auto'
cc @sunmarc @Cyrilvallez
Humm, no strong opinion on my side, but as it was added as a convenience at the beginning when adding TP support, it is now part of several examples and codebases... However, it's true that we did not think of this possibility - let's wait for @ArthurZucker to see if we want to start decoupling them
Should be "fixed" by this PR on the accelerate side. The error will still be there, it's just gonna be explanatory enough.
Just providing some additonal info: If this issue is caused by the PyTorch DDP not knowing about DTensor (i.e. loaded model have TP + DP), there's currently one private API call that makes PyTorch DDP aware of DTensor https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/parallel/ddp.py#L70 Feel free to try this if you want the PyTorch DDP + TP to work instead of raising errors
I think we still should rather raise an error. I think it breaks w/ our dataloader + I'd assume most of the users don't know device_map="auto" applies TP.
It's probably better to decouple those ! It can be very confusing to the users that device_map can work differently when running in distributed mode or not. Also it will be better for the user to know that they are actually using tp by specifying tp_plan or tp_size.
if people load model with device_map='auto', they will ended up with a TP-enabled model with model.parameters() as DTensor.
device_map = "auto" only affects the script if you do run torchdistributed, but yes it does make sense to decouple!
Can you take care of it @SunMarc ?
I am in favor of decoupling them as well. Just it's hard to associate device to tensor parallel.
Appreciated if we can make the change. In the pytorch github issue, people got confused about dtensor error in DDP training. many people hit the same issue
device_map = "auto" only affects the script if you do run torchdistributed, but yes it does make sense to decouple!
That might actually happen for training since users will run the script with accelerate launch for ddp for example
Will open a PR !
My question is: why not add something so that DTensor and DDP can work together?
it's more about user expectation. when setting device_map=auto, whether they are asking to enable TP. I feel TP and device placement are two separate things
I'm still getting an error related to this. Please see https://github.com/pytorch/pytorch/issues/155463#issuecomment-3060504993.
I still got this error
AssertionError: found no DeviceMesh from dtensor args for c10d.broadcast_.default!if I usetp_plan='auto'and not providedevice_map. The cause of this error is exactly what @twoflypig mentioned above.If I add
device_map(set to any value), then I got this error:ValueError:tp_plananddevice_mapare mutually exclusive. Choose either one for parallelization.