pytorch-cifar icon indicating copy to clipboard operation
pytorch-cifar copied to clipboard

Slower than TF?

Open chao1224 opened this issue 7 years ago • 3 comments

I'm testing the speed-up of ResNet on TF and PyTorch.

In TF, typically it can converge within 80k steps, which is 80k batches, and when we set batch-size=128, that should be around ~205 epochs in PyTorch.

One interesting thing is, in TF I can finish 80k steps in about 6 hours. But in PyTorch, running 200 epochs took me around 13 hours. And this will expand to around 20 hours if I want to test 300 epochs.

I thought PyTorch should be much faster than TF. Does anyone knows the solution to this? BTW. I'm using ec2 g2.2xlarge.

Here is the ResNet20 TF implementation

chao1224 avatar Nov 23 '17 06:11 chao1224

This is interesting. Do you have per-epoch timings? That would be easier to compare I think.

TheShadow29 avatar Jun 19 '18 09:06 TheShadow29

@TheShadow29 Thanks for replying.

I noticed the author has been updating this repo. One thing I didn't try before is using cudnn.benchmark = True, which helps a lot.

Besides this may somehow relate to the machine. I have two gpu card in my local machine, and using the latest version it took ~20 s/epoch on 1080 and ~90 s/epoch on K40. I guess the one I tested before is also affected by this. (ec2 instance sometimes can be a little slow)

chao1224 avatar Jun 20 '18 03:06 chao1224

One thing I didn't try before is using cudnn.benchmark = True, which helps a lot.

Woah, I din't know that. I am slightly unfamiliar with ec2 instances. Does g2.2xlarge have one K40? Also, which Resnet are you using on Pytorch (Resnet18/34)? I am currently trying a mobile net implementation (https://github.com/TheShadow29/pyt-mobilenet). I will try to play with the resnet models as well and update the results.

TheShadow29 avatar Jun 20 '18 05:06 TheShadow29