Remove need to sync Gpu stream before deallocating memory
Summary
Functionality is added to Gpu::Device and CArena to wait until the next stream sync before deallocating memory and to avoid double syncs.
Additional background
Currently, there is a lot of mixing/confusion of what CArena, Device and StreamManager are each meant to do. If delay_memory_free_until_sync is true, sync_before_memory_free has no effect. In the future there could be a single-stream no sync mode for (non host-accessible) device memory.
Checklist
The proposed changes:
- [ ] fix a bug or incorrect behavior in AMReX
- [x] add new capabilities to AMReX
- [ ] changes answers in the test suite to more than roundoff level
- [ ] are likely to significantly affect the results of downstream AMReX users
- [ ] include documentation in the code and/or rst files, if appropriate
I am worried about the complexity. It's always hard to reason about threading. So I might have missed something. Suppose there are two threads. Both call gpuStream() and obtain the same stream. Then both launch a gpu kernel and call Gpu::streamSynchronize(). To thread 1, let's suppose what it sees are (1) thread 1 modifies with m_stream_op_id; (2) thread 1 launches a gpu kernel; (3) thread 1 calls Gpu::streamSynchronize() that sees m_stream_op_id modified by thread 0 and eventually calls cudaStreamSynchronize and sets m_last_sync to the latest m_stream_op_id. To thread 0, let's suppose what it sees are (1) thread 1 modifies with m_stream_op_id; (2) thread 0 modifies m_stream_op_id; (3) thread 0 launches a gpu kernel that happens before cudaStreamSynchronize from thread 1; (4) thread 0 calls Gpu::streamSynchronize that does nothing because it sees m_last_sync == m_stream_op_id. In the end thread 0's kernel is not synced.
It is indeed super complicated when used with multiple threads. In that specific example, since the kernel launch from thread 0 happens before cudaStreamSynchronize is called from thread 1, the single stream sync would sync both kernels. However, I now notice a flaw if the thread 0 kernel launch happens after thread 1 calls cudaStreamSynchronize. This would be strange since thread 0 updated m_stream_op_id before thread 1 called Gpu::streamSynchronize() and thread 0 should not really be doing anything that takes time between updating m_stream_op_id and launching the kernel, however it is technically possible and would result in the kernel from thread 0 to not be synced.
since the kernel launch from thread 0 happens before cudaStreamSynchronize is called from thread 1
I meant after.
Hi @AlexanderSinn - do you want to experiment with a different approach, or can we close this?
Yes I am still working on this. Next I will try to give each stream an array with one bool per omp thread to store if the stream is synced.
Could you show some performance data comparing the development branch with this PR?