genn Take advantage of CUDA 10 graphs

Take advantage of CUDA 10 graphs

Open neworderofjamie opened this issue 4 years ago • 3 comments

They're designed to reduce the overhead in the exact scenario used by GeNN:

Loop over timesteps … shortKernel1 shortKernel2 … shortKernelN

See https://devblogs.nvidia.com/cuda-graphs/

Apr 27 '20 12:04 neworderofjamie

Interesting idea ... and nice intro at this link. I assume that if the kernels are a little less trivial and need to get parameters passed which values change occasionally - those would not be "baked in" when the graph is created? ...

Apr 27 '20 20:04 tnowotny

Initial test was underwhelming (all averaged over 4 runs on the Titan RTX with no spike recording):

Benchmark	Standard time [s]	Graph time [s]
VA benchmark with 10k neurons, running for 10s	2.45	2.38
Microcircuit, running for 1s	0.63	0.60

All in all, focussing on more optimal spike recording would seem a better use of time! Although it is pretty cool that, on this machine, you can simulate the microcircuit faster than realtime if you don't record spikes.

Apr 28 '20 13:04 neworderofjamie

However, the best case here is about a 7us improvement per timestep which is better than the paltry 0.4us they manage to achieve in that blog and the 4us they advertise here

Apr 28 '20 14:04 neworderofjamie

genn genn copied to clipboard

Take advantage of CUDA 10 graphs

genn
genn copied to clipboard