pytorch-cifar Slower than TF?

I'm testing the speed-up of ResNet on TF and PyTorch.

In TF, typically it can converge within 80k steps, which is 80k batches, and when we set batch-size=128, that should be around ~205 epochs in PyTorch.

One interesting thing is, in TF I can finish 80k steps in about 6 hours. But in PyTorch, running 200 epochs took me around 13 hours. And this will expand to around 20 hours if I want to test 300 epochs.

I thought PyTorch should be much faster than TF. Does anyone knows the solution to this? BTW. I'm using ec2 g2.2xlarge.

Here is the ResNet20 TF implementation

Nov 23 '17 06:11 chao1224

This is interesting. Do you have per-epoch timings? That would be easier to compare I think.

Jun 19 '18 09:06 TheShadow29

@TheShadow29 Thanks for replying.

I noticed the author has been updating this repo. One thing I didn't try before is using cudnn.benchmark = True, which helps a lot.

Besides this may somehow relate to the machine. I have two gpu card in my local machine, and using the latest version it took ~20 s/epoch on 1080 and ~90 s/epoch on K40. I guess the one I tested before is also affected by this. (ec2 instance sometimes can be a little slow)

Jun 20 '18 03:06 chao1224

One thing I didn't try before is using cudnn.benchmark = True, which helps a lot.

Woah, I din't know that. I am slightly unfamiliar with ec2 instances. Does g2.2xlarge have one K40? Also, which Resnet are you using on Pytorch (Resnet18/34)? I am currently trying a mobile net implementation (https://github.com/TheShadow29/pyt-mobilenet). I will try to play with the resnet models as well and update the results.

Jun 20 '18 05:06 TheShadow29

pytorch-cifar pytorch-cifar copied to clipboard

Slower than TF?

pytorch-cifar
pytorch-cifar copied to clipboard