genn
genn copied to clipboard
Take advantage of CUDA 10 graphs
They're designed to reduce the overhead in the exact scenario used by GeNN:
Loop over timesteps … shortKernel1 shortKernel2 … shortKernelN
See https://devblogs.nvidia.com/cuda-graphs/
Interesting idea ... and nice intro at this link. I assume that if the kernels are a little less trivial and need to get parameters passed which values change occasionally - those would not be "baked in" when the graph is created? ...
Initial test was underwhelming (all averaged over 4 runs on the Titan RTX with no spike recording):
Benchmark | Standard time [s] | Graph time [s] |
---|---|---|
VA benchmark with 10k neurons, running for 10s | 2.45 | 2.38 |
Microcircuit, running for 1s | 0.63 | 0.60 |
All in all, focussing on more optimal spike recording would seem a better use of time! Although it is pretty cool that, on this machine, you can simulate the microcircuit faster than realtime if you don't record spikes.
However, the best case here is about a 7us improvement per timestep which is better than the paltry 0.4us they manage to achieve in that blog and the 4us they advertise here