FLAMEGPU2
FLAMEGPU2 copied to clipboard
Reduce max Curve probes
CINECA hackathon exposed that the loop inside Curve::getVariable(const VariableHash variable_hash)
has a significant impact on performance for the non-rtc brute force model.
Running ~163k agents on V100s. The default configuration ran a step of the input kernel in 387ms, however when the loop was capped to 2 iterations (when we checked there would not be more probes), the runtime dropped to 217ms. (Relative to RTC 145ms)
There are a few ways to consider improving this:
- Rework curve to use cukoo/robinhood(/perfect) hashing to reduce the max probes required.
- Improve the compile-time hash, we are receiving too many collisions at low load factors(e.g. ~20/1024), suggesting a problem with our hashing scheme.
- Restructure the function to aide NVCC branch prediction. E.g. would manually unrolling the first few iterations we expect probing to complete in help. Would switching to a recursive approach help? (I did attempt 2 forms of recursive loop, one with a max depth and one without, both performed similar to the default loop version).
- Enforce a max probes (e.g. 10), and enforce this by throwing an exception. If Curve was moved to shared memory, this would reduce pressure by giving ensembles separate curve hash tables, and allow max probes to be changed at runtime.
Profiles which informed this (c1024i1024 is default configuration): https://drive.google.com/drive/folders/1M16Xhe0P9efcNb_3gRyW3TSZBiVru2F6?usp=sharing