pytorch-cifar
                                
                                
                                
                                    pytorch-cifar copied to clipboard
                            
                            
                            
                        Slower than TF?
I'm testing the speed-up of ResNet on TF and PyTorch.
In TF, typically it can converge within 80k steps, which is 80k batches, and when we set batch-size=128, that should be around ~205 epochs in PyTorch.
One interesting thing is, in TF I can finish 80k steps in about 6 hours. But in PyTorch, running 200 epochs took me around 13 hours. And this will expand to around 20 hours if I want to test 300 epochs.
I thought PyTorch should be much faster than TF. Does anyone knows the solution to this? BTW. I'm using ec2 g2.2xlarge.
Here is the ResNet20 TF implementation
This is interesting. Do you have per-epoch timings? That would be easier to compare I think.
@TheShadow29 Thanks for replying.
I noticed the author has been updating this repo. One thing I didn't try before is using cudnn.benchmark = True, which helps a lot.
Besides this may somehow relate to the machine. I have two gpu card in my local machine, and using the latest version it took ~20 s/epoch on 1080 and ~90 s/epoch on K40. I guess the one I tested before is also affected by this. (ec2 instance sometimes can be a little slow)
One thing I didn't try before is using cudnn.benchmark = True, which helps a lot.
Woah, I din't know that. I am slightly unfamiliar with ec2 instances. Does g2.2xlarge have one K40? Also, which Resnet are you using on Pytorch (Resnet18/34)? I am currently trying a mobile net implementation (https://github.com/TheShadow29/pyt-mobilenet). I will try to play with the resnet models as well and update the results.