celeritas icon indicating copy to clipboard operation
celeritas copied to clipboard

Add support for NVHPC `-stdpar`

Open sethrj opened this issue 3 years ago • 6 comments

Explore auto-parallelization using Nvidia's PGI-derived NVHPC tool suite. We can track development issues on here.

Our initial path is just to modify the host code pathways so that they always run on device, and later we'll cleanly support both hose and device dispatch.

  • [x] Install geant4
  • [x] unsupported procedure
  • [ ] references to stack variables
  • [ ] ...

Issues (newest first)

memory access error

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  for_each: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

### CAUGHT SIGNAL: 6 ### address: 0x3ea0003faa38,  signal =  SIGABRT, value =    6, description = abort program (formerly SIGIOT).

running through cuda-gdb:

CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x18c3a40 (ProcessPrimariesLauncher.hh:55)

Thread 1 "celeritas_globa" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]
0x00000000018c3a50 in celeritas::detail::ProcessPrimariesLauncher<(celeritas::MemSpace)0>::operator() (tid=...)
    at /home/users/s3j/.local/src/celeritas/src/celeritas/track/detail/ProcessPrimariesLauncher.hh:55
55	    TrackInitializer& init    = data_.initializers[ThreadId(

This is because data_ is a reference to memory on the host stack. We're going to have to change all our kernel calls to either:

  • allocate all of our arguments with "global memory" (which might be more efficient for HIP anyway?)
  • store launcher data by value and make sure we're not accidentally passing references into the stdpar launch calls

invalid validate

  • celeritas: internal assertion failed: CELER_VALIDATE cannot be called from device code" thrown in the test fixture's constructor.
  • Use if target magic to conditionally compile for host

unreachable unreachable

  • nvlink error : Undefined reference to '__builtin_unreachable' in 'src/CMakeFiles/celeritas.dir/celeritas/em/generated/BetheHeitlerInteract.cc.o'
  • Just don't define it when building stdpar (of course now we get lots of invalid noreturn warnings)

atomics

  • Since CUDA/HIP are disabled size_type was defaulting to size_t instead of unsigned int
  • Then src/libceleritas.so: undefined reference to atomicAdd(unsigned int*, unsigned int)'` due to host code also referencing it
  • Use if target magic

atomics!

demo interactor resize

"/home/users/s3j/.local/src/celeritas/app/demo-interactor/HostKNDemoRunner.cc", line 87: error: no instance of overloaded function "resize" matches the argument list
            argument types are: (demo_interactor::DetectorStateData<celeritas::Ownership::value, celeritas::MemSpace::host> *, demo_interactor::DetectorParamsData, demo_interactor::KNDemoRunArgs::size_type)
      resize(&detector_states, detector_params, args.max_steps);

just skip the demo interactor for now

unsupported procedure

  • @paulromano got errors while trying to build InitTracks.cc: NVC++-F-0155-Compiler failed to translate accelerator region (see -Minfo messages): Unsupported procedure
  • @mcolg tracked this down to a CELER_VALIDATE
  • I've updated CELER_DEVICE_COMPILE to act as though we're in "device compile" mode when using -stdpar 98122dc9952f3790a3ebb079b4732585a05a3ed5

Geant4 build

  • Geant4 threads are incompatible (nvhpc doesn't like static thread_local in template classes)
  • Recursive template instantiation depth is too small
  • Patched spack with https://github.com/spack/spack/pull/32185
  • Fixed upstream geant4 as emdna-V11-00-25

Warnings

Fixed numerous warnings in https://github.com/celeritas-project/celeritas/pull/486

Test failures

@pcanal dug down on some slight floating point differences between vanilla GCC and stdpar: we're making incorrectly strict assumptions about floating point behavior in a couple of our unit tests: 2e04478ea9831b5222d6ac53374f333d1cfa7677

sethrj avatar Sep 03 '22 16:09 sethrj

We've decided to suspend work on this for now: if AMD hints at having experimental support for automatic offloading (or something like it) then it will definitely be worth reopening to investigate it as a portability layer.

sethrj avatar Sep 29 '22 15:09 sethrj

We've decided to suspend work on this for now: if AMD hints at having experimental support for automatic offloading (or something like it) then it will definitely be worth reopening to investigate it as a portability layer.

I'm not sure if this is still of interest, but if it is we've added support for fairly symmetric functionality, please see here and here. We'd definitely be interested in cooperating:)

AlexVlx avatar Aug 15 '24 23:08 AlexVlx

Hey @AlexVlx that's great! Our team is a little overloaded at the moment, but this would be a great project for an intern to implement? We're going to try to bring in more people next year onto our team, and if you have any summer students (or heck, winter students!) we'd love to get in touch and help get this effort off the ground.

sethrj avatar Aug 16 '24 02:08 sethrj

You should also explore using heterogenous memory management (HMM) since it allows the device to access static host memory, including stack objects. It's best used on systems with high-speed links, such as NVLink on Grace Hopper systems, but works, albeit slower, over PCIe connections. This article, which I co-authored, might help as well.

mcolg avatar Aug 16 '24 15:08 mcolg

Thanks @mcolg ! Since the time that we first explored this, we did some substantial refactoring of how we launch kernels (see #743 and #783) to fix various odd behaviors we saw on multiple platforms due to passing too much data as a kernel launch argument. I think we'll encounter many fewer problems next time we try...

sethrj avatar Aug 16 '24 15:08 sethrj

Supersedes #1067

sethrj avatar Apr 17 '25 12:04 sethrj