celeritas icon indicating copy to clipboard operation
celeritas copied to clipboard

Debug parallel crashes running with multiple streams on Frontier

Open sethrj opened this issue 1 year ago • 1 comments

We discovered that ROCm 5.7.1 and higher hang during multithreaded Geant4 runs. The problem appears to be a regression in the async memory allocation that results in a race condition, or possibly a bug in thrust: we've seen some cases where a kernel launch on one thread and an async malloc/free on another cause the app to lock up.

TODO: fill this in from OLCF help tickets

sethrj avatar Jul 09 '24 01:07 sethrj

Worked around with using #1318

sethrj avatar Jul 29 '24 20:07 sethrj