Investigate if reducing registers by modifying code generation is possible
The COBAHH example uses more than 32 registers per threads even in single precision, reducing the theoretical occupancy of the stateupdate kernel, see #266. I'm wondering if there is a way to easily reduce the register usage by modifying the way the code is generated. Currently, there are many intermediate variables produced (the lio variables in the generated code). I guess this is optimized for C++ performance, that means to generate code with as few operations as possible. Is there a way to instead optimize for as few intermediate results that have to remain in registers as possible? For the GPU it would be much more important to reach 100% theoretical occupancy than to reduce the number of arithmetic operations.
Try disabling loop invariant optimizations. They make sense for C++, where constants used for all indices of a loop are precomputed once in order to reduce computation time in the loop. Makes no sense for GPU, where each thread computes those constants. And this likely increases register usage.
See https://github.com/denisalevi/brian2cuda-paper/issues/21