mpich stream: implement stream

Pull Request Description

Add workq based stream enqueue implementation.

Caveat

The wait kernel will block allocation and free of GPU registered host buffer, resulting in a potential deadlock

It turns out that, at least for CUDA, using unregistered host buffer for staging is fine. I am not sure how cudaMemcpyAsync deals with unregistered host buffer, but no errors! Potentially it wasn't run truely as async, but non-optimal is better than not working at all.

To avoid registered host buffer, this includes genq or yaksa pools since the pools need allocate slabs, yaksa needs an option to treat unregistered host buffer the same as registered buffer, as well as make the pool to use unregistered buffers.

EDIT: we also need avoid yaksa's lazy stream creation because the stream creation is also locked out by the wait kernel.

[skip warnings]

Author Checklist

[x] Provide Description Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
[x] Commits Follow Good Practice Commits are self-contained and do not do two things at once. Commit message is of the form: module: short description Commit message explains what's in the commit.
[ ] Passes All Tests Whitespace checker. Warnings test. Additional tests via comments.
[x] Contribution Agreement For non-Argonne authors, check contribution agreement. If necessary, request an explicit comment from your companies PR approval manager.

Jun 18 '22 15:06 hzhou

test:mpich/ch4/most ✔️ test:mpich/ch3/most ✔️ test:mpich/ch4/gpu/ofi ❌ - (typo in testlist)

Jul 21 '22 22:07 hzhou

test:mpich/ch4/gpu/ofi

EDIT: TIMED OUT. On my local computer it is cuda-11.2, which worked. It is cuda-11.1 on Jenkins. I wonder whether that makes a difference.

Jul 25 '22 17:07 hzhou

test:mpich/ch4/gpu/ofi

Aug 08 '22 22:08 hzhou

test:mpich/ch4/most test:mpich/ch4/gpu/ofi

Aug 11 '22 00:08 hzhou

test:mpich/ch4/most

Aug 11 '22 02:08 hzhou

test:mpich/ch4/gpu/ofi

Aug 11 '22 14:08 hzhou

The wait kernel has too many issues. This PR passes my local testing but still times out on Jenkins. We should consider use stream memory operations (https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEMOP.html). Nevertheless, this PR has already accumulated many commits including ADI changes and progress threads management. Thus I am adding an xfail entry and push for review as is.

Aug 11 '22 18:08 hzhou

test:mpich/ch4/most test:mpich/ch4/gpu/ofi

Aug 25 '22 15:08 hzhou

stream: implement stream_workq

Pull Request Description

Caveat

Author Checklist