pytorch-operator Pytorch version may have an effect on the training reproduction

I try to figure out why Bare Metal (BM) and PytorchJob (PJ) have different training results in https://github.com/kubeflow/pytorch-operator/issues/354#issue-999999536.

And now I find that PytorchJon v1.8.0 and 1.9.0 have different training results both on BM and PJ.

Experiment settings

Two V100 GPU machines 48/49. Each has 4 cards. We have 8 GPUs in total.
DDP training resnet18 on mnist dataset with batchsize=256 and epochs=1
set random seed=1

BM

# torch             1.8.0+cu111
# torchvision       0.9.0+cu111
Train Epoch: 0 [0/30]   loss=2.5691
Train Epoch: 0 [10/30]  loss=2.2320
Train Epoch: 0 [20/30]  loss=0.8108
Test Epoch: 0 [0/40]    acc=33.5938
Test Epoch: 0 [10/40]   acc=35.5469
Test Epoch: 0 [20/40]   acc=34.7098
Test Epoch: 0 [30/40]   acc=35.0302
Test Epoch: 0, acc=35.7200
test acc: 35.72, best acc: 35.72
training seconds: 19.506625175476074
best_acc: 35.72

# torch             1.9.0+cu111
# torchvision       0.10.0+cu111
Train Epoch: 0 [0/30]   loss=2.5137
Train Epoch: 0 [10/30]  loss=2.4295
Train Epoch: 0 [20/30]  loss=0.9048
Test Epoch: 0 [0/40]    acc=63.2812
Test Epoch: 0 [10/40]   acc=64.9858
Test Epoch: 0 [20/40]   acc=63.8021
Test Epoch: 0 [30/40]   acc=63.9365
Test Epoch: 0, acc=64.1200
test acc: 64.12, best acc: 64.12
training seconds: 18.64181399345398
best_acc: 64.12

PJ

I build docker images from different versions of the PyTorch base images.

# FROM pytorch/pytorch:1.8.1-cuda11.1-cudnn8-runtime
Train Epoch: 0 [0/30]   loss=2.5691
Train Epoch: 0 [10/30]  loss=2.5132
Train Epoch: 0 [20/30]  loss=0.7198
Test Epoch: 0 [0/40]    acc=38.2812
Test Epoch: 0 [10/40]   acc=40.9091
Test Epoch: 0 [20/40]   acc=39.8996
Test Epoch: 0 [30/40]   acc=40.4738
Test Epoch: 0, acc=40.9600
test acc: 40.96, best acc: 40.96
training seconds: 20.630347967147827
best_acc: 40.96

# FROM pytorch/pytorch:1.9.0-cuda11.1-cudnn8-runtime
Train Epoch: 0 [0/30]   loss=2.5137
Train Epoch: 0 [10/30]  loss=2.3939
Train Epoch: 0 [20/30]  loss=0.6989
Test Epoch: 0 [0/40]    acc=67.5781
Test Epoch: 0 [10/40]   acc=69.2827
Test Epoch: 0 [20/40]   acc=68.4152
Test Epoch: 0 [30/40]   acc=67.8805
Test Epoch: 0, acc=67.9700
test acc: 67.97, best acc: 67.97
training seconds: 26.458710193634033
best_acc: 67.97

Please let me know if I write the wrong code. I've posted my code here: https://github.com/Shuai-Xie/mnist-pytorchjob-example.

Sep 21 '21 03:09 Shuai-Xie

PyTorch 1.9.0 introduces elastic distributed training but it is not stable. I think maybe you can wait until 1.9.1 is released and have a try again.

Sep 21 '21 07:09 gaocegege

The version in title means the version of PyTorch instead of PyTorchJob. Let's fix it on 1.8.0 and see how the difference is introduced.

Sep 21 '21 08:09 zw0610

The version in title means the version of PyTorch instead of PyTorchJob. Let's fix it on 1.8.0 and see how the difference is introduced.

Oh yes. I'm sorry to make this mistake. I'll change it right now.

Sep 21 '21 12:09 Shuai-Xie

Thanks for your kind reply @zw0610 @gaocegege.

I'll fix the Pytorch version on 1.8.0 in the following experiments and look forward to figuring out this problem early with your help.

By the way, the example https://github.com/kubeflow/pytorch-operator/blob/master/examples/mnist/Dockerfile uses pytorch/pytorch:1.0-cuda10.0-cudnn7-runtime.

Many Thanks.

Sep 21 '21 12:09 Shuai-Xie

pytorch-operator pytorch-operator copied to clipboard

Pytorch version may have an effect on the training reproduction

Experiment settings

BM

PJ

pytorch-operator
pytorch-operator copied to clipboard