Henry Tsang
Henry Tsang
Linking related issue: https://github.com/pytorch/torchrec/issues/1221
@kiukchung Thanks for the quick reply! cat hello_world.py ``` $ cat hello_world.py print("Hello, TorchX!") ``` cat test2.py ``` $ cat test2.py import torch print(torch.__version__) ``` running test2.py ``` $ python...
fyi https://github.com/pytorch/pytorch/issues/116423
@tiankongdeguiji I suspect this isn't really a problem. I tested it with the NCCL model_parallel test_sharding_dp by printing the state_dict out, and found them to be the same. I suspect...
@tiankongdeguiji can you try to inspect the state dict of local model right after `local_model = DistributedModelParallel(`, ie before the copy_state_dict? When I ran it, it showed those parameters are...
@tiankongdeguiji Okay I think you are 100% right. Sorry I didn't understand your point on the torch seed part. I looked into it. A few points: 1. The problem isn't...
@tiankongdeguiji fyi I raised the issue to the team already. Probably will wait a bit. On the other hand, any suggestions on how to fix this in a nice way?...
@tiankongdeguiji fyi landed the fix https://github.com/pytorch/torchrec/commit/cc482f8a5f80fd8975de82ad22b65cda3348d872 Basically every time we call reset_parameters, we will also broadcast the re-initialized DP tables from rank 0 to all other ranks Though be sure...
for context I think DLRM init it with 1 / num_embeddings https://github.com/facebookresearch/dlrm/blob/main/dlrm_s_pytorch.py#L281
@di-wnd I think @PaulZhang12 brought it up, and it seems the result of that discussion is that its better to change user input than to change the default