stylable icon indicating copy to clipboard operation
stylable copied to clipboard

bps.push_pull give wrong result for pytorch

Open jasperzhong opened this issue 4 years ago • 1 comments

When I train mnist with pytorch, I found the output accuracy and loss are werid. Then I tried to print it out before push_pull.

Screen Shot 2020-07-18 at 8 24 08 PM

So I guess this is because push_pull gives wrong results for CPU tensors. https://github.com/bytedance/byteps/blob/cf020c97fc718ca209cbadbfac4cffa5e49d7d21/example/pytorch/train_mnist_byteps.py#L133

Actually it is a known bug which was also found in MXNet I have mentioned before. #247 Current workaround is to set the tensor to cuda.

tensor = torch.tensor(val).cuda()

jasperzhong avatar Jul 18 '20 12:07 jasperzhong

We also found that push pull on CPU tensor may lead to instability. We will check and fix it.

bobzhuyb avatar Jul 21 '20 07:07 bobzhuyb