amrex Remove need to sync Gpu stream before deallocating memory

Summary

Functionality is added to Gpu::Device and CArena to wait until the next stream sync before deallocating memory and to avoid double syncs.

Additional background

Currently, there is a lot of mixing/confusion of what CArena, Device and StreamManager are each meant to do. If delay_memory_free_until_sync is true, sync_before_memory_free has no effect. In the future there could be a single-stream no sync mode for (non host-accessible) device memory.

Checklist

The proposed changes:

[ ] fix a bug or incorrect behavior in AMReX
[x] add new capabilities to AMReX
[ ] changes answers in the test suite to more than roundoff level
[ ] are likely to significantly affect the results of downstream AMReX users
[ ] include documentation in the code and/or rst files, if appropriate

Apr 27 '25 17:04 AlexanderSinn

I am worried about the complexity. It's always hard to reason about threading. So I might have missed something. Suppose there are two threads. Both call gpuStream() and obtain the same stream. Then both launch a gpu kernel and call Gpu::streamSynchronize(). To thread 1, let's suppose what it sees are (1) thread 1 modifies with m_stream_op_id; (2) thread 1 launches a gpu kernel; (3) thread 1 calls Gpu::streamSynchronize() that sees m_stream_op_id modified by thread 0 and eventually calls cudaStreamSynchronize and sets m_last_sync to the latest m_stream_op_id. To thread 0, let's suppose what it sees are (1) thread 1 modifies with m_stream_op_id; (2) thread 0 modifies m_stream_op_id; (3) thread 0 launches a gpu kernel that happens before cudaStreamSynchronize from thread 1; (4) thread 0 calls Gpu::streamSynchronize that does nothing because it sees m_last_sync == m_stream_op_id. In the end thread 0's kernel is not synced.

Jun 04 '25 16:06 WeiqunZhang

It is indeed super complicated when used with multiple threads. In that specific example, since the kernel launch from thread 0 happens before cudaStreamSynchronize is called from thread 1, the single stream sync would sync both kernels. However, I now notice a flaw if the thread 0 kernel launch happens after thread 1 calls cudaStreamSynchronize. This would be strange since thread 0 updated m_stream_op_id before thread 1 called Gpu::streamSynchronize() and thread 0 should not really be doing anything that takes time between updating m_stream_op_id and launching the kernel, however it is technically possible and would result in the kernel from thread 0 to not be synced.

Jun 04 '25 17:06 AlexanderSinn

since the kernel launch from thread 0 happens before cudaStreamSynchronize is called from thread 1

I meant after.

Jun 04 '25 17:06 WeiqunZhang

Hi @AlexanderSinn - do you want to experiment with a different approach, or can we close this?

Jun 11 '25 17:06 atmyers

Yes I am still working on this. Next I will try to give each stream an array with one bool per omp thread to store if the stream is synced.

Jun 11 '25 20:06 AlexanderSinn

Could you show some performance data comparing the development branch with this PR?

Aug 08 '25 00:08 WeiqunZhang