celeritas
celeritas copied to clipboard
Add support for NVHPC `-stdpar`
Explore auto-parallelization using Nvidia's PGI-derived NVHPC tool suite. We can track development issues on here.
Our initial path is just to modify the host code pathways so that they always run on device, and later we'll cleanly support both hose and device dispatch.
- [x] Install geant4
- [x] unsupported procedure
- [ ] references to stack variables
- [ ] ...
Issues (newest first)
memory access error
terminate called after throwing an instance of 'thrust::system::system_error'
what(): for_each: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered
### CAUGHT SIGNAL: 6 ### address: 0x3ea0003faa38, signal = SIGABRT, value = 6, description = abort program (formerly SIGIOT).
running through cuda-gdb:
CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x18c3a40 (ProcessPrimariesLauncher.hh:55)
Thread 1 "celeritas_globa" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]
0x00000000018c3a50 in celeritas::detail::ProcessPrimariesLauncher<(celeritas::MemSpace)0>::operator() (tid=...)
at /home/users/s3j/.local/src/celeritas/src/celeritas/track/detail/ProcessPrimariesLauncher.hh:55
55 TrackInitializer& init = data_.initializers[ThreadId(
This is because data_ is a reference to memory on the host stack. We're going to have to change all our kernel calls to either:
- allocate all of our arguments with "global memory" (which might be more efficient for HIP anyway?)
- store launcher data by value and make sure we're not accidentally passing references into the stdpar launch calls
invalid validate
celeritas: internal assertion failed: CELER_VALIDATE cannot be called from device code" thrown in the test fixture's constructor.- Use
if targetmagic to conditionally compile for host
unreachable unreachable
nvlink error : Undefined reference to '__builtin_unreachable' in 'src/CMakeFiles/celeritas.dir/celeritas/em/generated/BetheHeitlerInteract.cc.o'- Just don't define it when building stdpar (of course now we get lots of invalid noreturn warnings)
atomics
- Since CUDA/HIP are disabled
size_typewas defaulting tosize_tinstead ofunsigned int - Then
src/libceleritas.so: undefined reference toatomicAdd(unsigned int*, unsigned int)'` due to host code also referencing it - Use
if targetmagic

demo interactor resize
"/home/users/s3j/.local/src/celeritas/app/demo-interactor/HostKNDemoRunner.cc", line 87: error: no instance of overloaded function "resize" matches the argument list
argument types are: (demo_interactor::DetectorStateData<celeritas::Ownership::value, celeritas::MemSpace::host> *, demo_interactor::DetectorParamsData, demo_interactor::KNDemoRunArgs::size_type)
resize(&detector_states, detector_params, args.max_steps);
just skip the demo interactor for now
unsupported procedure
- @paulromano got errors while trying to build InitTracks.cc:
NVC++-F-0155-Compiler failed to translate accelerator region (see -Minfo messages): Unsupported procedure - @mcolg tracked this down to a
CELER_VALIDATE - I've updated
CELER_DEVICE_COMPILEto act as though we're in "device compile" mode when using-stdpar98122dc9952f3790a3ebb079b4732585a05a3ed5
Geant4 build
- Geant4 threads are incompatible (nvhpc doesn't like
static thread_localin template classes) - Recursive template instantiation depth is too small
- Patched spack with https://github.com/spack/spack/pull/32185
- Fixed upstream geant4 as
emdna-V11-00-25
Warnings
Fixed numerous warnings in https://github.com/celeritas-project/celeritas/pull/486
Test failures
@pcanal dug down on some slight floating point differences between vanilla GCC and stdpar: we're making incorrectly strict assumptions about floating point behavior in a couple of our unit tests: 2e04478ea9831b5222d6ac53374f333d1cfa7677
We've decided to suspend work on this for now: if AMD hints at having experimental support for automatic offloading (or something like it) then it will definitely be worth reopening to investigate it as a portability layer.
We've decided to suspend work on this for now: if AMD hints at having experimental support for automatic offloading (or something like it) then it will definitely be worth reopening to investigate it as a portability layer.
I'm not sure if this is still of interest, but if it is we've added support for fairly symmetric functionality, please see here and here. We'd definitely be interested in cooperating:)
Hey @AlexVlx that's great! Our team is a little overloaded at the moment, but this would be a great project for an intern to implement? We're going to try to bring in more people next year onto our team, and if you have any summer students (or heck, winter students!) we'd love to get in touch and help get this effort off the ground.
You should also explore using heterogenous memory management (HMM) since it allows the device to access static host memory, including stack objects. It's best used on systems with high-speed links, such as NVLink on Grace Hopper systems, but works, albeit slower, over PCIe connections. This article, which I co-authored, might help as well.
Thanks @mcolg ! Since the time that we first explored this, we did some substantial refactoring of how we launch kernels (see #743 and #783) to fix various odd behaviors we saw on multiple platforms due to passing too much data as a kernel launch argument. I think we'll encounter many fewer problems next time we try...
Supersedes #1067