FLAMEGPU2 Deciding optimal block size

CUDA Occupancy api provides largest block size that achieves best occupancy.

Profiling of spatial messaging has however demonstrated that the smallest block size that achieves the best occupancy is the preferred choice (this is believed to be due to divergent behaviour).

We will need to re-implement the api function to handle this (it's implemented within cuda_runtime.h). Furthermore, other messaging types may have other preferences.

Oct 19 '20 12:10 Robadob

It may be worth ensuring that atleast one block is launched per SM to maximise performance. This may come as a side effect of choose smallest block size which achieves highest theoretical occupancy.

When concurrency is invovled this might not be true on a per-agent function basis. This also becomes much more complicated when concurrency between agent functions with variable runtime is relavent. Probably just needs investigating manually and tuned accordingly. Ideally should test on multiple scales of hardware, although in practice we should probably care most about high end GPUs.

Oct 20 '20 11:10 ptheywood

As an extra note worth considering in the future:

Most GPU architecutres can have 32 or 64 active warps per SM, and therefore 1024 or 2048 threads per SM. Threads per block appears to be 1024 in all cases (although I'm sure 2048 worked in the past on maxwell? but I could be very wrong).

Consumer Ampere on the other hand (SM_86) however, can have 48 active warps per SM (1536 threads). This means that full device occupancy cannot be achieved at the maximum block size of 1024 (2/3rds).

This shouldn't have an impact on our plan of generally using the lowest block size which achieves maximum occupancy (within reason, so probably probably 64/128 threads+).

Source

Oct 26 '20 14:10 ptheywood

Correction to OP, the cuda_runtime.h occupancy functions are publicly defined in the header. However the driver API versions (used by RTC) from cuda.h are not, Passing the RTC function to the runtime API version causes a fatal CUDA error.

Jun 15 '21 09:06 Robadob

During the CINECA hackathon, we found that generally a block size of 128 offered the best performance for spatial message iteration kernels (when this also achieved the maximised occupancy). There may be cases where this is not true, i.e. brute force we feel might benefit from larger.

For the spatial case (RTC?), changing the block size led to ~10% performance improvement, so this is worth pursuing.

Jun 23 '21 15:06 ptheywood

FLAMEGPU2 FLAMEGPU2 copied to clipboard

Deciding optimal block size

FLAMEGPU2
FLAMEGPU2 copied to clipboard