FLAMEGPU2 Migrate to Jitify2

[x] RTC and Execution (Works with CUDA12.0, Windows/Linux)
[x] Serialisation/Deserialisation (jitify2 pre-processed serialised objects are 2-3x larger)
[x] CUDA 12.3 support
- [x] Works with https://github.com/NVIDIA/jitify/pull/128, hence wait for merge.
- [x] Same branch deprecates the launch() method (used in CUDASimulation), says to replace with launch_raw().
[x] Investigate access violation from cuda().ModuleUnload() during sim shutdown, when CUDAAgent map is cleared by CUDASimulation destructor.
- Has occurred on Windows & Linux, but its inconsistent.
- gdb log: https://gist.github.com/Robadob/3cb0e93014d56f05587f4ea1ea581203 (with jitify2-misc-fixes2 branch)
[x] Reimplement jitify1 demangle used in curve_rtc.cpp
[x] Optimise serialisation load time
- Jitify2 serialises pre-NVRTC, this is 50x slower and produces 2x larger serial blob
- Waiting for clarity: https://github.com/NVIDIA/jitify/issues/133

Optimise compile time
- [ ] Rework of old header hack? (would require loading headers from file)
  - Preloading fgpu headers only cuts agent fn time from 6.8s to 4.1s.
  - Loadings CUDA headers too makes a big difference, but these may not be particularly stable between version
- [ ] Offline pre-process FLAMEGPU2 include hierarchy into a single header file? (using jitify tools?)
- [x] Wait for this PR to be merged?

[ ] Visual Studio 2019 support (we may be able to drop this)
[x] ManyLinux2014 support

Nov 15 '23 16:11 Robadob

Update regarding header pre-loading with Jitify2/CUDA 12.3

Windows/CUDA 12.0

No preload
Millis: 6822.000000
Millis: 6853.000000

Preloading FLAMEGPU headers
Millis: 4045.000000
Millis: 4277.000000

Preload FLAMEGPU + CUDA headers
Millis: 1296.000000
Millis: 1667.000000

Linux/CUDA 12.3

Jitify 2 from scratch (Waimu)
Millis: 25318.000000
Millis: 24143.000000

Preload FLAMEGPU + CUDA headers
Millis: 1376.000000
Millis: 2218.000000

CUDA 12.0 has ~30 CUDA headers to preload. CUDA 12.3 has ~257 CUDA headers to preload. (List contains some dupes)

Not clear whether we would want to generalise this code, to better handle different CUDA versions, because we could be potentially needing to update it with each CUDA update.

Edit: Removed from-cache times, latest commit has these matching Jitify1.

Nov 20 '23 10:11 Robadob

Current issue holding back the Jitify2 preprocesor branch is that it expects our flamegpu headers to be included as system header <> rather than " ". Waiting to here back from the dev (Ben) before I try to correct that on our side.

Nov 21 '23 16:11 Robadob

Did three full test runs last night, all passed, however in those cases the cmake jitify dependency was pointing at the preprocess branch. Not currently using that here as it causes all windows CI to fail with WError.

Linux/CUDA12.3/Seatbelts ON/GLM ON/Release Linux/CUDA12.3/Seatbelts OFF/GLM ON/Release Windows/CUDA12.0/Seatbelts ON/GLM OFF/Debug

In release builds kernels are taking ~1 second to compile each. As Jitify is now doing the pre-processing, this is closer to 2.5 seconds under Debug builds.

Nov 23 '23 10:11 Robadob

FLAMEGPU2 FLAMEGPU2 copied to clipboard

Migrate to Jitify2

FLAMEGPU2
FLAMEGPU2 copied to clipboard