Henry Tsang comments

Results 36 comments of


                                            Henry Tsang

[feature request] support exporting the model to ONNX format

Linking related issue: https://github.com/pytorch/torchrec/issues/1221

Couldn't get torchx to work with python 3.12

@kiukchung Thanks for the quick reply! cat hello_world.py ``` $ cat hello_world.py print("Hello, TorchX!") ``` cat test2.py ``` $ cat test2.py import torch print(torch.__version__) ``` running test2.py ``` $ python...

Couldn't get torchx to work with python 3.12

fyi https://github.com/pytorch/pytorch/issues/116423

DMP doesn't broadcast DataParallel ShardingType embedding table from the process with rank 0 to all other processes

@tiankongdeguiji I suspect this isn't really a problem. I tested it with the NCCL model_parallel test_sharding_dp by printing the state_dict out, and found them to be the same. I suspect...

DMP doesn't broadcast DataParallel ShardingType embedding table from the process with rank 0 to all other processes

@tiankongdeguiji can you try to inspect the state dict of local model right after `local_model = DistributedModelParallel(`, ie before the copy_state_dict? When I ran it, it showed those parameters are...

DMP doesn't broadcast DataParallel ShardingType embedding table from the process with rank 0 to all other processes

@tiankongdeguiji Okay I think you are 100% right. Sorry I didn't understand your point on the torch seed part. I looked into it. A few points: 1. The problem isn't...

DMP doesn't broadcast DataParallel ShardingType embedding table from the process with rank 0 to all other processes

@tiankongdeguiji fyi I raised the issue to the team already. Probably will wait a bit. On the other hand, any suggestions on how to fix this in a nice way?...

DMP doesn't broadcast DataParallel ShardingType embedding table from the process with rank 0 to all other processes

@tiankongdeguiji fyi landed the fix https://github.com/pytorch/torchrec/commit/cc482f8a5f80fd8975de82ad22b65cda3348d872 Basically every time we call reset_parameters, we will also broadcast the re-initialized DP tables from rank 0 to all other ranks Though be sure...

weight init in embedding_config should depend on embedding_dim

for context I think DLRM init it with 1 / num_embeddings https://github.com/facebookresearch/dlrm/blob/main/dlrm_s_pytorch.py#L281

weight init in embedding_config should depend on embedding_dim

@di-wnd I think @PaulZhang12 brought it up, and it seems the result of that discussion is that its better to change user input than to change the default