neon
neon copied to clipboard
Increase GPU Utilization?
This is my first time training on this GPU and using neon. I just started running some examples and I noticed that my GPU is only peaking about 31% during training and stays more around 22%. Can I increase the GPU usage somehow in order to train faster?
This will really depend on the model that you are training. If the model has small layers (feature map size and/or filter size), the GPU kernels will execute very fast and the python code in neon becomes the limiting factor. More computationally intensive models should be able load the GPU at nearly 100%. For example, you should see higher utilization in general for models training on imagenet data than cifar or mnist data. You can also try using more feature maps for each layer to get higher utilization, although whether this is worthwhile will depend on if that improves your model.
Well, for instance, I'm planning to do a lot of work with RNNs. The first example I ran was the Shakespeare text example. I ran python text_generation_lstm.py -b gpu . I was watching a movie at the same time so prior to running it my GPU was at around 5-10%. While training, it stayed around 25% and took about 1 minute 5 seconds for each epoch. I know I could blow through the training if I could utilize most my card.
Right so for that network, the hidden size is 512 and your batch size is probably set to 128. The LSTM layers are basically doing a for loop over time steps and evaluating several matrix multiplies at each step. In this case we are bound both by available parallelism and kernel launch overhead. In order to be compute bound in the GPU kernels and fully utilize available FLOPs, our matrix multiply kernels have a tile (CUDA block) size of at least 128x32. So at best you are getting 16 blocks (assuming 512x128 output of the matrix multiply), which may not fill your GPU depending on which one you have (for example Titan X has 24 SMs, each of which could concurrently execute 2 blocks of this kernel). The second problem here is kernel launch overhead. This matrix multiply kernel will probably complete in less than a millisecond due to the small size of the matrices involved, which may be shorter than the time it takes the python code to launch the next kernel, leaving your GPU under-utilized. We have some special RNN kernels available to help mitigate this launch overhead, but they currently only support specific cases of RNN and BiRNN layers.
Stewart, it should be pretty trivial to integrate these new gemm kernels into neon:
https://github.com/openai/openai-gemm
The 32x32 tiles really help utilization and are at least as fast as the 128x32 kernels. The python code also memoizes most of the launch overhead. Those kernels also allow arbitrary mixed precision (though that python code doesn't expose it to limit kernel permutations for libs that statically load them)
Thanks for sharing Scott. I'll take a look at these.
I do have some changes worked out to integrate these GEMM kernels into neon, but need to figure out how to install the code from the openai-gemm repository with neon. Temporarily using a setup.py in my local.
Unfortunately Tom this won't really give you much improvement on that network. I did some profiling and the launch overhead is significant as I expected. However, most of the overhead appears to be coming from the elementwise optree processing (see image below). In the future, our graph based framework should help with this. We have a preview of that available now if you are interested, but it doesn't support these types of recurrent networks yet: https://github.com/NervanaSystems/ngraph
