FLAMEGPU2
FLAMEGPU2 copied to clipboard
Migrate to Jitify2
- [x] RTC and Execution (Works with CUDA12.0, Windows/Linux)
- [x] Serialisation/Deserialisation (jitify2 pre-processed serialised objects are 2-3x larger)
- [x] CUDA 12.3 support
- [x] Works with https://github.com/NVIDIA/jitify/pull/128, hence wait for merge.
- [x] Same branch deprecates the
launch()
method (used inCUDASimulation
), says to replace withlaunch_raw()
.
- [x] Investigate access violation from
cuda().ModuleUnload()
during sim shutdown, whenCUDAAgent
map is cleared byCUDASimulation
destructor.- Has occurred on Windows & Linux, but its inconsistent.
- gdb log: https://gist.github.com/Robadob/3cb0e93014d56f05587f4ea1ea581203 (with
jitify2-misc-fixes2
branch)
- [x] Reimplement jitify1 demangle used in
curve_rtc.cpp
- [x] Optimise serialisation load time
- Jitify2 serialises pre-NVRTC, this is 50x slower and produces 2x larger serial blob
- Waiting for clarity: https://github.com/NVIDIA/jitify/issues/133
- Optimise compile time
- [ ] Rework of old header hack? (would require loading headers from file)
- Preloading fgpu headers only cuts agent fn time from 6.8s to 4.1s.
- Loadings CUDA headers too makes a big difference, but these may not be particularly stable between version
- [ ] Offline pre-process FLAMEGPU2 include hierarchy into a single header file? (using jitify tools?)
- [x] Wait for this PR to be merged?
- [ ] Rework of old header hack? (would require loading headers from file)
- [ ] Visual Studio 2019 support (we may be able to drop this)
- [x] ManyLinux2014 support
Update regarding header pre-loading with Jitify2/CUDA 12.3
Windows/CUDA 12.0
No preload
Millis: 6822.000000
Millis: 6853.000000
Preloading FLAMEGPU headers
Millis: 4045.000000
Millis: 4277.000000
Preload FLAMEGPU + CUDA headers
Millis: 1296.000000
Millis: 1667.000000
Linux/CUDA 12.3
Jitify 2 from scratch (Waimu)
Millis: 25318.000000
Millis: 24143.000000
Preload FLAMEGPU + CUDA headers
Millis: 1376.000000
Millis: 2218.000000
CUDA 12.0 has ~30 CUDA headers to preload. CUDA 12.3 has ~257 CUDA headers to preload. (List contains some dupes)
Not clear whether we would want to generalise this code, to better handle different CUDA versions, because we could be potentially needing to update it with each CUDA update.
Edit: Removed from-cache times, latest commit has these matching Jitify1.
Current issue holding back the Jitify2 preprocesor branch is that it expects our flamegpu headers to be included as system header <>
rather than " "
. Waiting to here back from the dev (Ben) before I try to correct that on our side.
Did three full test runs last night, all passed, however in those cases the cmake jitify dependency was pointing at the preprocess branch. Not currently using that here as it causes all windows CI to fail with WError.
Linux/CUDA12.3/Seatbelts ON/GLM ON/Release Linux/CUDA12.3/Seatbelts OFF/GLM ON/Release Windows/CUDA12.0/Seatbelts ON/GLM OFF/Debug
In release builds kernels are taking ~1 second to compile each. As Jitify is now doing the pre-processing, this is closer to 2.5 seconds under Debug builds.