torchrec [Question] why torchrec explicit make dp lookup as DistributedDataParallel instead of letting DistributedModelParallel handle it?

[Question] why torchrec explicit make dp lookup as DistributedDataParallel instead of letting DistributedModelParallel handle it?

Open shijieliu opened this issue 3 months ago • 0 comments

Hi, team,

In the ShardedEmbeddingBagCollection, I found torchrec explicit make dp lookup as DistributedDataParallel(code here). And I also know inside DistributedModelParallel we have ddp wrapper to warp the non-sharded part of model such as mlp as ddp. And ddp wrapper is also using DistributedDataParallel.

So I am wondering why we choose to explictly wrapping dp lookup instead of letting ddp wrapper in DistributedModelParallel process dp lookup and mlp together? Is there any hidden restriction?

Since DistributedDataParallel is relying on .name_parameters()(code here), I am not sure if overriding .name_parameters() for ShardedEmbeddingBagCollection can enable ddp wrapper in DistributedModelParallel to process dp lookup?

Mar 27 '24 05:03 shijieliu

torchrec torchrec copied to clipboard

[Question] why torchrec explicit make dp lookup as DistributedDataParallel instead of letting DistributedModelParallel handle it?

torchrec
torchrec copied to clipboard