celeritas
celeritas copied to clipboard
Debug parallel crashes running with multiple streams on Frontier
We discovered that ROCm 5.7.1 and higher hang during multithreaded Geant4 runs. The problem appears to be a regression in the async memory allocation that results in a race condition, or possibly a bug in thrust: we've seen some cases where a kernel launch on one thread and an async malloc/free on another cause the app to lock up.
TODO: fill this in from OLCF help tickets
Worked around with using #1318