Junjie Wang

Results 24 comments of Junjie Wang

Looks like TorchVision's cuda version does not match with that of PyTorch?

Paused this for now and we need to verify with TorchRec on its usage.

Hey @HamidShojanazeri, upon looking into the error log, you are using TP size = 1. Can you try TP size more than 1, like 2? Because if TP size =...

@Vatshank Can you try to use ColwiseParallel and RowwiseParallel instead? You need to specify the path of the model though. Let me know if that helps.

OK, upon further investigation, below is my findings for the last two test errors: 1. It is PT 3.11 specific. 2. It is related to this Int Flag: https://github.com/pytorch/pytorch/blob/master/torch/distributed/elastic/multiprocessing/api.py#L100-L104 So...

> @fduwjj I think you've nearly tracked it down there: as far as I can tell, this is _not_ a bug in Python 3.11, but a subtle behaviour relating to...

@jayaddison Thanks for providing so detailed explanation.. This is indeed something I wasn't aware of. Thanks!!

Two changes have been landed and close this for now. Feel free to reopen it.