garage icon indicating copy to clipboard operation
garage copied to clipboard

PyTorch on CPU is slower than TF

Open ryanjulian opened this issue 4 years ago • 4 comments

See https://github.com/pytorch/pytorch/issues/975 for more info

PyTorch TRPO appears 50% slower than TF. Not sure about PPO, but I expect the wall-clock time gap will be the same.

To fix this issue, make PyTorch perform at least as well as TF, or confirm that we've done the best we can on CPU with PyTorch.

ryanjulian avatar Nov 15 '19 00:11 ryanjulian

The issue you linked from pytorch was fixed quite a while ago. I think if garage's pytorch is slower than TF then most likely it has something to do with our implementation.

@lywong92 Since you added pytorch DDPG and PPO, I want to know if you have any observation on performance against TF? So we can know if this has something to do with pytorch in general or just TRPO itself.

naeioi avatar Nov 18 '19 07:11 naeioi

The issue you linked from pytorch was fixed quite a while ago. I think if garage's pytorch is slower than TF then most likely it has something to do with our implementation.

@lywong92 Since you added pytorch DDPG and PPO, I want to know if you have any observation on performance against TF? So we can know if this has something to do with pytorch in general or just TRPO itself.

I didn't pay too much attention on the actual time it took to run DDPG in torch vs tf. Are we comparing the total time it takes to run the algorithms with the same parameters here?

lywong92 avatar Nov 20 '19 22:11 lywong92

This might be unrelated, but we found big performance differences between using llvm-openmp vs intel-openmp. Weirdly enough, this is observed even when we use GPU for both forward and backward pass.

Some strange dependency issue in conda is causing this (e.g. the newest version of libgcc runtime depends on package _openmp_mutex which brings along the llvm openmp runtime instead of the Intel one). Worth checking which OpenMP implementation you're using.

alex-petrenko avatar Feb 25 '20 07:02 alex-petrenko

I've encountered issues with the run time of pytorch on CPU before which have been improved by artificially limiting the number of threads utilized with a call such as torch.set_num_threads(4) - I am not sure why exactly, but it seems that pytorch sometimes will incorrectly utilize the number of threads.

jamesborg46 avatar Feb 08 '21 06:02 jamesborg46