FLAMEGPU2
FLAMEGPU2 copied to clipboard
Deciding optimal block size
CUDA Occupancy api provides largest block size that achieves best occupancy.
Profiling of spatial messaging has however demonstrated that the smallest block size that achieves the best occupancy is the preferred choice (this is believed to be due to divergent behaviour).
We will need to re-implement the api function to handle this (it's implemented within cuda_runtime.h). Furthermore, other messaging types may have other preferences.
It may be worth ensuring that atleast one block is launched per SM to maximise performance. This may come as a side effect of choose smallest block size which achieves highest theoretical occupancy.
When concurrency is invovled this might not be true on a per-agent function basis. This also becomes much more complicated when concurrency between agent functions with variable runtime is relavent. Probably just needs investigating manually and tuned accordingly. Ideally should test on multiple scales of hardware, although in practice we should probably care most about high end GPUs.
As an extra note worth considering in the future:
Most GPU architecutres can have 32
or 64
active warps per SM, and therefore 1024
or 2048
threads per SM. Threads per block appears to be 1024
in all cases (although I'm sure 2048
worked in the past on maxwell? but I could be very wrong).
Consumer Ampere on the other hand (SM_86
) however, can have 48
active warps per SM (1536
threads). This means that full device occupancy cannot be achieved at the maximum block size of 1024
(2/3rds).
This shouldn't have an impact on our plan of generally using the lowest block size which achieves maximum occupancy (within reason, so probably probably 64
/128
threads+).
Correction to OP, the cuda_runtime.h
occupancy functions are publicly defined in the header. However the driver API versions (used by RTC) from cuda.h
are not, Passing the RTC function to the runtime API version causes a fatal CUDA error.
During the CINECA hackathon, we found that generally a block size of 128 offered the best performance for spatial message iteration kernels (when this also achieved the maximised occupancy). There may be cases where this is not true, i.e. brute force we feel might benefit from larger.
For the spatial case (RTC?), changing the block size led to ~10% performance improvement, so this is worth pursuing.