celeritas icon indicating copy to clipboard operation
celeritas copied to clipboard

StatusChecker test crashes with a debug build of VecGeom

Open pcanal opened this issue 9 months ago • 11 comments

With a Release build of VecGeom, track_StatusChecker fails but is reported by ctest as succeeding (see details) below. With a Debug build of VecGeom, the output is similar but this time it is reported as failure (at least for me).

Details
$ /home/pcanal/geant/spack/2025-03-builds/celeritas/test/celeritas/track_StatusChecker
Celeritas version 0.6.0-rc.1.11+CudaRdcUtils_v4.a7ef40dfe
Testing on 1 process
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from StatusCheckerTest
[ RUN      ] StatusCheckerTest.host
status: Loading VecGeom geometry from GDML at /home/pcanal/geant/sources/celeritas/test/geocel/data/two-boxes.gdml
status: Initializing tracking information
info: CUDA stack size was changed from 1024 to 8192; restoring to original values
/home/pcanal/geant/sources/celeritas/src/celeritas/phys/PhysicsParams.cc:398: warning: No processes are defined for particle 'electron'
status: Celeritas core setup complete
rank 0: status: Celeritas core state initialization complete
rank 0: /home/pcanal/geant/sources/celeritas/src/celeritas/track/detail/StatusCheckExecutor.hh:87: error: status was improperly reverted: {"geo":{"dir":[0.0,0.0,1.0],"is_on_boundary":false,"is_outside":false,"pos":[[0.0,0.0,0.0],"cm"],"volume_id":"inner"},"particle":{"energy":[8.0,"MeV"],"particle_id":"gamma"},"sim":{"along_step_action":"along-step-neutral","event_id":0,"num_steps":0,"parent_id":-1,"post_step_action":"physics-discrete-select","status":"initializing","step_length":[8.012297598897574,"cm"],"time":[0.0,"s"],"track_id":7},"thread_id":15,"track_slot_id":15}
info: Exception during action 1 came from: kernel context: track slot 15 in 'status-check', track 7 of event 0
rank 0: /home/pcanal/geant/sources/celeritas/src/celeritas/track/detail/StatusCheckExecutor.hh:138: error: along-step action cannot yet change: {"geo":{"dir":[0.0,0.0,1.0],"is_on_boundary":false,"is_outside":false,"pos":[[0.0,0.0,0.0],"cm"],"volume_id":"inner"},"particle":{"energy":[8.0,"MeV"],"particle_id":"gamma"},"sim":{"along_step_action":"pre-step","event_id":0,"num_steps":0,"parent_id":-1,"post_step_action":"scat-klein-nishina","status":"alive","step_length":[8.012297598897574,"cm"],"time":[0.0,"s"],"track_id":7},"thread_id":15,"track_slot_id":15}
info: Exception during action 4 came from: kernel context: track slot 15 in 'status-check', track 7 of event 0
rank 0: /home/pcanal/geant/sources/celeritas/src/celeritas/track/detail/StatusCheckExecutor.hh:153: error: new post-step action is out of order: {"geo":{"dir":[-0.1808708751922138,-0.14253713853809177,0.9731232659044667],"is_on_boundary":false,"is_outside":false,"pos":[[0.0,0.0,0.0],"cm"],"volume_id":"inner"},"particle":{"energy":[5.594297051238843,"MeV"],"particle_id":"gamma"},"sim":{"along_step_action":"along-step-neutral","event_id":0,"num_steps":0,"parent_id":-1,"post_step_action":"physics-discrete-select","status":"alive","step_length":[8.012297598897574,"cm"],"time":[0.0,"s"],"track_id":7},"thread_id":15,"track_slot_id":15}
info: Exception during action 6 came from: kernel context: track slot 15 in 'status-check', track 7 of event 0
[       OK ] StatusCheckerTest.host (230 ms)
[ RUN      ] StatusCheckerTest.device
info: Creating 1 device streams
/home/pcanal/geant/sources/celeritas/src/celeritas/phys/PhysicsParams.cc:398: warning: No processes are defined for particle 'electron'
status: Celeritas core setup complete
rank 0: status: Celeritas core state initialization complete
/home/pcanal/geant/sources/celeritas/src/celeritas/track/detail/StatusCheckExecutor.hh:149:
celeritas: internal assertion failed: new post-step action is out of order: '!(prev_order == params.implicit_order || next_order == params.implicit_order || OrderedAction{prev_order, prev_post_step} < OrderedAction{next_order, next_post_step})'
[       OK ] StatusCheckerTest.device (23 ms)
[----------] 2 tests from StatusCheckerTest (253 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (253 ms total)
[  PASSED  ] 2 tests.
/home/pcanal/geant/spack/2025-03-builds/celeritas/test/celeritas/track_StatusChecker: tests PASSED
Failed to destroy stream: CUDA error: unspecified launch failure


</details>

pcanal avatar Mar 20 '25 22:03 pcanal

@pcanal I've seen this before; it's due to vecgeom 2 crashing during destruction :( If you run through gdb/lldb you'll see

sethrj avatar Mar 28 '25 13:03 sethrj

@pcanal I'm working on cleaning up the issues. Did you confirm whether or not this is a crash during VecGeom teardown?

sethrj avatar Apr 17 '25 11:04 sethrj

I have not confirmed nor denied. I'll attempt reproducing it today.

pcanal avatar Apr 17 '25 15:04 pcanal

I get this:

Celeritas version 0.6.0-rc.2.40+wip.df0b0c3bd
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from StatusCheckerTest
[ RUN      ] StatusCheckerTest.host
status: Loading VecGeom geometry from GDML at /home/pcanal/geant/sources/celeritas/test/geocel/data/two-boxes.gdml
status: Initializing tracking information
info: CUDA stack size was changed from 1024 to 8192; restoring to original values
/home/pcanal/geant/sources/celeritas/src/celeritas/phys/PhysicsParams.cc:399: warning: No processes are defined for particle 'electron'
status: Celeritas core setup complete
status: Celeritas core state initialization complete
/home/pcanal/geant/sources/celeritas/src/celeritas/track/detail/StatusCheckExecutor.hh:88: error: status was improperly reverted: {"geo":{"dir":[0.0,0.0,1.0],"is_on_boundary":false,"is_outside":false,"pos":[[0.0,0.0,0.0],"cm"],"volume_id":"inner"},"mat":"Al","particle":{"energy":[8.0,"MeV"],"particle_id":"gamma"},"sim":{"along_step_action":"along-step-neutral","event_id":0,"num_steps":0,"parent_id":-1,"post_step_action":"physics-discrete-select","status":"initializing","step_length":[8.012297598897574,"cm"],"time":[0.0,"s"],"track_id":7},"thread_id":15,"track_slot_id":15}
info: Exception during action 1 came from: track slot 15 in kernel 'status-check': {"geo":{"dir":[0.0,0.0,1.0],"is_on_boundary":false,"is_outside":false,"pos":[[0.0,0.0,0.0],"cm"],"volume_id":"inner"},"mat":"Al","particle":{"energy":[8.0,"MeV"],"particle_id":"gamma"},"sim":{"along_step_action":"along-step-neutral","event_id":0,"num_steps":0,"parent_id":-1,"post_step_action":"physics-discrete-select","status":"initializing","step_length":[8.012297598897574,"cm"],"time":[0.0,"s"],"track_id":7},"thread_id":15,"track_slot_id":15}
/home/pcanal/geant/sources/celeritas/src/celeritas/track/detail/StatusCheckExecutor.hh:139: error: along-step action cannot yet change: {"geo":{"dir":[0.0,0.0,1.0],"is_on_boundary":false,"is_outside":false,"pos":[[0.0,0.0,0.0],"cm"],"volume_id":"inner"},"mat":"Al","particle":{"energy":[8.0,"MeV"],"particle_id":"gamma"},"sim":{"along_step_action":"pre-step","event_id":0,"num_steps":0,"parent_id":-1,"post_step_action":"scat-klein-nishina","status":"alive","step_length":[8.012297598897574,"cm"],"time":[0.0,"s"],"track_id":7},"thread_id":15,"track_slot_id":15}
info: Exception during action 4 came from: track slot 15 in kernel 'status-check': {"geo":{"dir":[0.0,0.0,1.0],"is_on_boundary":false,"is_outside":false,"pos":[[0.0,0.0,0.0],"cm"],"volume_id":"inner"},"mat":"Al","particle":{"energy":[8.0,"MeV"],"particle_id":"gamma"},"sim":{"along_step_action":"pre-step","event_id":0,"num_steps":0,"parent_id":-1,"post_step_action":"scat-klein-nishina","status":"alive","step_length":[8.012297598897574,"cm"],"time":[0.0,"s"],"track_id":7},"thread_id":15,"track_slot_id":15}
/home/pcanal/geant/sources/celeritas/src/celeritas/track/detail/StatusCheckExecutor.hh:154: error: new post-step action is out of order: {"geo":{"dir":[-0.1808708751922138,-0.14253713853809177,0.9731232659044667],"is_on_boundary":false,"is_outside":false,"pos":[[0.0,0.0,0.0],"cm"],"volume_id":"inner"},"mat":"Al","particle":{"energy":[5.594297051238843,"MeV"],"particle_id":"gamma"},"sim":{"along_step_action":"along-step-neutral","event_id":0,"num_steps":0,"parent_id":-1,"post_step_action":"physics-discrete-select","status":"alive","step_length":[8.012297598897574,"cm"],"time":[0.0,"s"],"track_id":7},"thread_id":15,"track_slot_id":15}
info: Exception during action 6 came from: track slot 15 in kernel 'status-check': {"geo":{"dir":[-0.1808708751922138,-0.14253713853809177,0.9731232659044667],"is_on_boundary":false,"is_outside":false,"pos":[[0.0,0.0,0.0],"cm"],"volume_id":"inner"},"mat":"Al","particle":{"energy":[5.594297051238843,"MeV"],"particle_id":"gamma"},"sim":{"along_step_action":"along-step-neutral","event_id":0,"num_steps":0,"parent_id":-1,"post_step_action":"physics-discrete-select","status":"alive","step_length":[8.012297598897574,"cm"],"time":[0.0,"s"],"track_id":7},"thread_id":15,"track_slot_id":15}
[       OK ] StatusCheckerTest.host (91 ms)
[ RUN      ] StatusCheckerTest.device
info: Creating 1 device streams
/home/pcanal/geant/sources/celeritas/src/celeritas/phys/PhysicsParams.cc:399: warning: No processes are defined for particle 'electron'
status: Celeritas core setup complete
status: Celeritas core state initialization complete
/home/pcanal/geant/sources/celeritas/src/celeritas/track/detail/StatusCheckExecutor.hh:150:
celeritas: internal assertion failed: new post-step action is out of order: '!(prev_order == params.implicit_order || next_order == params.implicit_order || OrderedAction{prev_order, prev_post_step} < OrderedAction{next_order, next_post_step})'
[       OK ] StatusCheckerTest.device (5 ms)
[----------] 2 tests from StatusCheckerTest (97 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (97 ms total)
[  PASSED  ] 2 tests.
/exp/projects/celeritas/spack/2025-03-builds/celeritas/test/celeritas/track_StatusChecker: tests PASSED
Failed to destroy stream: CUDA error: unspecified launch failure
[Thread 0x7fffe1fff000 (LWP 26764) exited]
[Thread 0x7fffedde9000 (LWP 26763) exited]
[Inferior 1 (process 26759) exited normally]
gdb) print $_exitcode
$1 = 0
(gdb)

pcanal avatar Apr 17 '25 18:04 pcanal

So several error messages, one 'assertion failed' ... and a 0 exit code.

pcanal avatar Apr 17 '25 19:04 pcanal

Sooo a heisenbug?

sethrj avatar Apr 17 '25 19:04 sethrj

No, it the same result 100% of the time. It is just a hiding in plain sight bug?

There are 2 potential separate issues:

  1. Why is the tests issuing error messages and assert?
  2. Why are those error not seen/reported by ctest (first order cause, because the executable return 0, so actually it is why is the executable return 0 with so many failures?) and a third question:
  3. If this test fails silently is there any other failing tests that are not reported.

pcanal avatar Apr 17 '25 19:04 pcanal

  1. Why is the tests issuing error messages and assert?

The status checker test can only test the failure mechanisms: i.e., in the host test it makes sure we throw a DebugError; and on the device, it should raise a device assert, which should be caught in our CELER_CUDA_CALL. As context for the test we're logging the details of the caught assertions to screen , which I could see as confusing.

The weird thing about CUDA assertions is that all subsequent CUDA runtime calls before a full device reset will also raise an assertion. During cleanup we see (and catch and simply log) a warning about the device streams.

So this is all 100% expected and should not lead to a test failure.

  1. Why are those error not seen/reported by ctest (first order cause, because the executable return 0, so actually it is why is the executable return 0 with so many failures?)>

If the test is returning zero then the test is doing its job, everything is passing, and there is no problem.

  1. If this test fails silently is there any other failing tests that are not reported.

I think it's succeeding and should be succeeding...


I may be confused because in January you reported on slack:

So on lxplus-gpu with asetup AthSimulation,local/simulation/main_AthSimulation_x86_64-el9-gcc13-opt,2024-01-22T1700 local static build of Geant4 (commit f7118554a12 + my global static PR), VecGeom v1 (commit eaf16cd97, v1.2.10 + a few commit)) and Celeritas (commit 4dbeff8d) VecGeom and Celeritas build in Debug mode. I needed to increase the stack size to 32K (i.e. CUDA_STACK_SIZE=32768 in order to pass the following test:

         58 - geocel/vg/Vecgeom:SimpleCmsTest.* (Subprocess aborted)
         59 - geocel/vg/Vecgeom:FourLevelsTest.* (Subprocess aborted)
         61 - geocel/vg/Vecgeom:SolidsTest.* (Subprocess aborted)
         62 - geocel/vg/Vecgeom:CmseTest.* (Subprocess aborted)
         63 - geocel/vg/Vecgeom:FourLevelsGeantTest.* (Subprocess aborted)
         65 - geocel/vg/Vecgeom:SolidsGeantTest.* (Subprocess aborted)
        291 - app/celer-geo:gpu (Timeout)

I still have the following 2 failures:

        186 - celeritas/geo/Geometry (Failed)
        264 - celeritas/track/StatusChecker (Subprocess aborted)

and I'd seen a similar failure on wildstyle: note that that the failure there in StatusChecker is because it aborted.

The subprocess shouldn't abort which is what I thought your issue was??

sethrj avatar Apr 18 '25 01:04 sethrj

[ ... the purpose of the test is to check the error ... ]

Silly me, I missed that part :(. So indeed the issue is actually that the test does not properly catch the error and/or fails some other way in DEBUG builds. And it seems to come from VecGeom actual asserts:

(gdb) bt
#0  0x00007ffff2c5452c in __pthread_kill_implementation () from /usr/lib64/libc.so.6
#1  0x00007ffff2c07686 in raise () from /usr/lib64/libc.so.6
#2  0x00007ffff2bf1833 in abort () from /usr/lib64/libc.so.6
#3  0x00007ffff2bf175b in __assert_fail_base.cold () from /usr/lib64/libc.so.6
#4  0x00007ffff2c003c6 in __assert_fail () from /usr/lib64/libc.so.6
#5  0x00007ffff3679a6e in vecgeom::cxx::CudaAssertError (err=cudaErrorLaunchFailure)
    at /exp/projects/celeritas/pcanal/sources/VecGeom/source/backend/cuda/Interface.cpp:53
#6  0x00007ffff36737dc in vecgeom::cxx::DevicePtrBase::Free (this=0x55555846da40)

pcanal avatar Apr 18 '25 17:04 pcanal

Ok. Easy solution, VecGeom free shouldn't assert. In fact it should throw and then VecGeom destructors should catch it.

sethrj avatar Apr 18 '25 23:04 sethrj

@pcanal This should be fixed by https://gitlab.cern.ch/VecGeom/VecGeom/-/merge_requests/1305.

sethrj avatar Apr 20 '25 11:04 sethrj

That's since been merged into vecgeom 2 so I'll close this.

sethrj avatar Jun 19 '25 14:06 sethrj