StatusChecker test crashes with a debug build of VecGeom
With a Release build of VecGeom, track_StatusChecker fails but is reported by ctest as succeeding (see details) below.
With a Debug build of VecGeom, the output is similar but this time it is reported as failure (at least for me).
Details
$ /home/pcanal/geant/spack/2025-03-builds/celeritas/test/celeritas/track_StatusChecker
Celeritas version 0.6.0-rc.1.11+CudaRdcUtils_v4.a7ef40dfe
Testing on 1 process
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from StatusCheckerTest
[ RUN ] StatusCheckerTest.host
status: Loading VecGeom geometry from GDML at /home/pcanal/geant/sources/celeritas/test/geocel/data/two-boxes.gdml
status: Initializing tracking information
info: CUDA stack size was changed from 1024 to 8192; restoring to original values
/home/pcanal/geant/sources/celeritas/src/celeritas/phys/PhysicsParams.cc:398: warning: No processes are defined for particle 'electron'
status: Celeritas core setup complete
rank 0: status: Celeritas core state initialization complete
rank 0: /home/pcanal/geant/sources/celeritas/src/celeritas/track/detail/StatusCheckExecutor.hh:87: error: status was improperly reverted: {"geo":{"dir":[0.0,0.0,1.0],"is_on_boundary":false,"is_outside":false,"pos":[[0.0,0.0,0.0],"cm"],"volume_id":"inner"},"particle":{"energy":[8.0,"MeV"],"particle_id":"gamma"},"sim":{"along_step_action":"along-step-neutral","event_id":0,"num_steps":0,"parent_id":-1,"post_step_action":"physics-discrete-select","status":"initializing","step_length":[8.012297598897574,"cm"],"time":[0.0,"s"],"track_id":7},"thread_id":15,"track_slot_id":15}
info: Exception during action 1 came from: kernel context: track slot 15 in 'status-check', track 7 of event 0
rank 0: /home/pcanal/geant/sources/celeritas/src/celeritas/track/detail/StatusCheckExecutor.hh:138: error: along-step action cannot yet change: {"geo":{"dir":[0.0,0.0,1.0],"is_on_boundary":false,"is_outside":false,"pos":[[0.0,0.0,0.0],"cm"],"volume_id":"inner"},"particle":{"energy":[8.0,"MeV"],"particle_id":"gamma"},"sim":{"along_step_action":"pre-step","event_id":0,"num_steps":0,"parent_id":-1,"post_step_action":"scat-klein-nishina","status":"alive","step_length":[8.012297598897574,"cm"],"time":[0.0,"s"],"track_id":7},"thread_id":15,"track_slot_id":15}
info: Exception during action 4 came from: kernel context: track slot 15 in 'status-check', track 7 of event 0
rank 0: /home/pcanal/geant/sources/celeritas/src/celeritas/track/detail/StatusCheckExecutor.hh:153: error: new post-step action is out of order: {"geo":{"dir":[-0.1808708751922138,-0.14253713853809177,0.9731232659044667],"is_on_boundary":false,"is_outside":false,"pos":[[0.0,0.0,0.0],"cm"],"volume_id":"inner"},"particle":{"energy":[5.594297051238843,"MeV"],"particle_id":"gamma"},"sim":{"along_step_action":"along-step-neutral","event_id":0,"num_steps":0,"parent_id":-1,"post_step_action":"physics-discrete-select","status":"alive","step_length":[8.012297598897574,"cm"],"time":[0.0,"s"],"track_id":7},"thread_id":15,"track_slot_id":15}
info: Exception during action 6 came from: kernel context: track slot 15 in 'status-check', track 7 of event 0
[ OK ] StatusCheckerTest.host (230 ms)
[ RUN ] StatusCheckerTest.device
info: Creating 1 device streams
/home/pcanal/geant/sources/celeritas/src/celeritas/phys/PhysicsParams.cc:398: warning: No processes are defined for particle 'electron'
status: Celeritas core setup complete
rank 0: status: Celeritas core state initialization complete
/home/pcanal/geant/sources/celeritas/src/celeritas/track/detail/StatusCheckExecutor.hh:149:
celeritas: internal assertion failed: new post-step action is out of order: '!(prev_order == params.implicit_order || next_order == params.implicit_order || OrderedAction{prev_order, prev_post_step} < OrderedAction{next_order, next_post_step})'
[ OK ] StatusCheckerTest.device (23 ms)
[----------] 2 tests from StatusCheckerTest (253 ms total)
[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (253 ms total)
[ PASSED ] 2 tests.
/home/pcanal/geant/spack/2025-03-builds/celeritas/test/celeritas/track_StatusChecker: tests PASSED
Failed to destroy stream: CUDA error: unspecified launch failure
</details>
@pcanal I've seen this before; it's due to vecgeom 2 crashing during destruction :( If you run through gdb/lldb you'll see
@pcanal I'm working on cleaning up the issues. Did you confirm whether or not this is a crash during VecGeom teardown?
I have not confirmed nor denied. I'll attempt reproducing it today.
I get this:
Celeritas version 0.6.0-rc.2.40+wip.df0b0c3bd
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from StatusCheckerTest
[ RUN ] StatusCheckerTest.host
status: Loading VecGeom geometry from GDML at /home/pcanal/geant/sources/celeritas/test/geocel/data/two-boxes.gdml
status: Initializing tracking information
info: CUDA stack size was changed from 1024 to 8192; restoring to original values
/home/pcanal/geant/sources/celeritas/src/celeritas/phys/PhysicsParams.cc:399: warning: No processes are defined for particle 'electron'
status: Celeritas core setup complete
status: Celeritas core state initialization complete
/home/pcanal/geant/sources/celeritas/src/celeritas/track/detail/StatusCheckExecutor.hh:88: error: status was improperly reverted: {"geo":{"dir":[0.0,0.0,1.0],"is_on_boundary":false,"is_outside":false,"pos":[[0.0,0.0,0.0],"cm"],"volume_id":"inner"},"mat":"Al","particle":{"energy":[8.0,"MeV"],"particle_id":"gamma"},"sim":{"along_step_action":"along-step-neutral","event_id":0,"num_steps":0,"parent_id":-1,"post_step_action":"physics-discrete-select","status":"initializing","step_length":[8.012297598897574,"cm"],"time":[0.0,"s"],"track_id":7},"thread_id":15,"track_slot_id":15}
info: Exception during action 1 came from: track slot 15 in kernel 'status-check': {"geo":{"dir":[0.0,0.0,1.0],"is_on_boundary":false,"is_outside":false,"pos":[[0.0,0.0,0.0],"cm"],"volume_id":"inner"},"mat":"Al","particle":{"energy":[8.0,"MeV"],"particle_id":"gamma"},"sim":{"along_step_action":"along-step-neutral","event_id":0,"num_steps":0,"parent_id":-1,"post_step_action":"physics-discrete-select","status":"initializing","step_length":[8.012297598897574,"cm"],"time":[0.0,"s"],"track_id":7},"thread_id":15,"track_slot_id":15}
/home/pcanal/geant/sources/celeritas/src/celeritas/track/detail/StatusCheckExecutor.hh:139: error: along-step action cannot yet change: {"geo":{"dir":[0.0,0.0,1.0],"is_on_boundary":false,"is_outside":false,"pos":[[0.0,0.0,0.0],"cm"],"volume_id":"inner"},"mat":"Al","particle":{"energy":[8.0,"MeV"],"particle_id":"gamma"},"sim":{"along_step_action":"pre-step","event_id":0,"num_steps":0,"parent_id":-1,"post_step_action":"scat-klein-nishina","status":"alive","step_length":[8.012297598897574,"cm"],"time":[0.0,"s"],"track_id":7},"thread_id":15,"track_slot_id":15}
info: Exception during action 4 came from: track slot 15 in kernel 'status-check': {"geo":{"dir":[0.0,0.0,1.0],"is_on_boundary":false,"is_outside":false,"pos":[[0.0,0.0,0.0],"cm"],"volume_id":"inner"},"mat":"Al","particle":{"energy":[8.0,"MeV"],"particle_id":"gamma"},"sim":{"along_step_action":"pre-step","event_id":0,"num_steps":0,"parent_id":-1,"post_step_action":"scat-klein-nishina","status":"alive","step_length":[8.012297598897574,"cm"],"time":[0.0,"s"],"track_id":7},"thread_id":15,"track_slot_id":15}
/home/pcanal/geant/sources/celeritas/src/celeritas/track/detail/StatusCheckExecutor.hh:154: error: new post-step action is out of order: {"geo":{"dir":[-0.1808708751922138,-0.14253713853809177,0.9731232659044667],"is_on_boundary":false,"is_outside":false,"pos":[[0.0,0.0,0.0],"cm"],"volume_id":"inner"},"mat":"Al","particle":{"energy":[5.594297051238843,"MeV"],"particle_id":"gamma"},"sim":{"along_step_action":"along-step-neutral","event_id":0,"num_steps":0,"parent_id":-1,"post_step_action":"physics-discrete-select","status":"alive","step_length":[8.012297598897574,"cm"],"time":[0.0,"s"],"track_id":7},"thread_id":15,"track_slot_id":15}
info: Exception during action 6 came from: track slot 15 in kernel 'status-check': {"geo":{"dir":[-0.1808708751922138,-0.14253713853809177,0.9731232659044667],"is_on_boundary":false,"is_outside":false,"pos":[[0.0,0.0,0.0],"cm"],"volume_id":"inner"},"mat":"Al","particle":{"energy":[5.594297051238843,"MeV"],"particle_id":"gamma"},"sim":{"along_step_action":"along-step-neutral","event_id":0,"num_steps":0,"parent_id":-1,"post_step_action":"physics-discrete-select","status":"alive","step_length":[8.012297598897574,"cm"],"time":[0.0,"s"],"track_id":7},"thread_id":15,"track_slot_id":15}
[ OK ] StatusCheckerTest.host (91 ms)
[ RUN ] StatusCheckerTest.device
info: Creating 1 device streams
/home/pcanal/geant/sources/celeritas/src/celeritas/phys/PhysicsParams.cc:399: warning: No processes are defined for particle 'electron'
status: Celeritas core setup complete
status: Celeritas core state initialization complete
/home/pcanal/geant/sources/celeritas/src/celeritas/track/detail/StatusCheckExecutor.hh:150:
celeritas: internal assertion failed: new post-step action is out of order: '!(prev_order == params.implicit_order || next_order == params.implicit_order || OrderedAction{prev_order, prev_post_step} < OrderedAction{next_order, next_post_step})'
[ OK ] StatusCheckerTest.device (5 ms)
[----------] 2 tests from StatusCheckerTest (97 ms total)
[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (97 ms total)
[ PASSED ] 2 tests.
/exp/projects/celeritas/spack/2025-03-builds/celeritas/test/celeritas/track_StatusChecker: tests PASSED
Failed to destroy stream: CUDA error: unspecified launch failure
[Thread 0x7fffe1fff000 (LWP 26764) exited]
[Thread 0x7fffedde9000 (LWP 26763) exited]
[Inferior 1 (process 26759) exited normally]
gdb) print $_exitcode
$1 = 0
(gdb)
So several error messages, one 'assertion failed' ... and a 0 exit code.
Sooo a heisenbug?
No, it the same result 100% of the time. It is just a hiding in plain sight bug?
There are 2 potential separate issues:
- Why is the tests issuing error messages and assert?
- Why are those error not seen/reported by ctest (first order cause, because the executable return 0, so actually it is why is the executable return 0 with so many failures?) and a third question:
- If this test fails silently is there any other failing tests that are not reported.
- Why is the tests issuing error messages and assert?
The status checker test can only test the failure mechanisms: i.e., in the host test it makes sure we throw a DebugError; and on the device, it should raise a device assert, which should be caught in our CELER_CUDA_CALL. As context for the test we're logging the details of the caught assertions to screen , which I could see as confusing.
The weird thing about CUDA assertions is that all subsequent CUDA runtime calls before a full device reset will also raise an assertion. During cleanup we see (and catch and simply log) a warning about the device streams.
So this is all 100% expected and should not lead to a test failure.
- Why are those error not seen/reported by ctest (first order cause, because the executable return 0, so actually it is why is the executable return 0 with so many failures?)>
If the test is returning zero then the test is doing its job, everything is passing, and there is no problem.
- If this test fails silently is there any other failing tests that are not reported.
I think it's succeeding and should be succeeding...
I may be confused because in January you reported on slack:
So on lxplus-gpu with asetup AthSimulation,local/simulation/main_AthSimulation_x86_64-el9-gcc13-opt,2024-01-22T1700 local static build of Geant4 (commit f7118554a12 + my global static PR), VecGeom v1 (commit eaf16cd97, v1.2.10 + a few commit)) and Celeritas (commit 4dbeff8d) VecGeom and Celeritas build in Debug mode. I needed to increase the stack size to 32K (i.e. CUDA_STACK_SIZE=32768 in order to pass the following test:
58 - geocel/vg/Vecgeom:SimpleCmsTest.* (Subprocess aborted) 59 - geocel/vg/Vecgeom:FourLevelsTest.* (Subprocess aborted) 61 - geocel/vg/Vecgeom:SolidsTest.* (Subprocess aborted) 62 - geocel/vg/Vecgeom:CmseTest.* (Subprocess aborted) 63 - geocel/vg/Vecgeom:FourLevelsGeantTest.* (Subprocess aborted) 65 - geocel/vg/Vecgeom:SolidsGeantTest.* (Subprocess aborted) 291 - app/celer-geo:gpu (Timeout)I still have the following 2 failures:
186 - celeritas/geo/Geometry (Failed) 264 - celeritas/track/StatusChecker (Subprocess aborted)
and I'd seen a similar failure on wildstyle: note that that the failure there in StatusChecker is because it aborted.
The subprocess shouldn't abort which is what I thought your issue was??
[ ... the purpose of the test is to check the error ... ]
Silly me, I missed that part :(. So indeed the issue is actually that the test does not properly catch the error and/or fails some other way in DEBUG builds. And it seems to come from VecGeom actual asserts:
(gdb) bt
#0 0x00007ffff2c5452c in __pthread_kill_implementation () from /usr/lib64/libc.so.6
#1 0x00007ffff2c07686 in raise () from /usr/lib64/libc.so.6
#2 0x00007ffff2bf1833 in abort () from /usr/lib64/libc.so.6
#3 0x00007ffff2bf175b in __assert_fail_base.cold () from /usr/lib64/libc.so.6
#4 0x00007ffff2c003c6 in __assert_fail () from /usr/lib64/libc.so.6
#5 0x00007ffff3679a6e in vecgeom::cxx::CudaAssertError (err=cudaErrorLaunchFailure)
at /exp/projects/celeritas/pcanal/sources/VecGeom/source/backend/cuda/Interface.cpp:53
#6 0x00007ffff36737dc in vecgeom::cxx::DevicePtrBase::Free (this=0x55555846da40)
Ok. Easy solution, VecGeom free shouldn't assert. In fact it should throw and then VecGeom destructors should catch it.
@pcanal This should be fixed by https://gitlab.cern.ch/VecGeom/VecGeom/-/merge_requests/1305.
That's since been merged into vecgeom 2 so I'll close this.