stream: implement stream_workq
Pull Request Description
Add workq based stream enqueue implementation.
Caveat
The wait kernel will block allocation and free of GPU registered host buffer, resulting in a potential deadlock
It turns out that, at least for CUDA, using unregistered host buffer for staging is fine. I am not sure how cudaMemcpyAsync deals with unregistered host buffer, but no errors! Potentially it wasn't run truely as async, but non-optimal is better than not working at all.
To avoid registered host buffer, this includes genq or yaksa pools since the pools need allocate slabs, yaksa needs an option to treat unregistered host buffer the same as registered buffer, as well as make the pool to use unregistered buffers.
EDIT: we also need avoid yaksa's lazy stream creation because the stream creation is also locked out by the wait kernel.
[skip warnings]
Author Checklist
- [x] Provide Description Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
- [x] Commits Follow Good Practice
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short descriptionCommit message explains what's in the commit. - [ ] Passes All Tests Whitespace checker. Warnings test. Additional tests via comments.
- [x] Contribution Agreement For non-Argonne authors, check contribution agreement. If necessary, request an explicit comment from your companies PR approval manager.
test:mpich/ch4/most ✔️ test:mpich/ch3/most ✔️ test:mpich/ch4/gpu/ofi ❌ - (typo in testlist)
test:mpich/ch4/gpu/ofi
EDIT:
TIMED OUT. On my local computer it is cuda-11.2, which worked. It is cuda-11.1 on Jenkins. I wonder whether that makes a difference.
test:mpich/ch4/gpu/ofi
test:mpich/ch4/most test:mpich/ch4/gpu/ofi
test:mpich/ch4/most
test:mpich/ch4/gpu/ofi
The wait kernel has too many issues. This PR passes my local testing but still times out on Jenkins. We should consider use stream memory operations (https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEMOP.html). Nevertheless, this PR has already accumulated many commits including ADI changes and progress threads management. Thus I am adding an xfail entry and push for review as is.
test:mpich/ch4/most test:mpich/ch4/gpu/ofi