oneMKL
oneMKL copied to clipboard
add queue.wait() to sync interop stream
Signed-off-by: JackAKirk [email protected]
Description
This is a bug fix for failures first identified since the multi-streams implementation of the cuda backend in intel/llvm (failures identified here https://github.com/oneapi-src/oneMKL/pull/209#issuecomment-1192001880):
The failed tests are due to the lack of a stream synchronisation after some cusolver interop functions such as cusolverdnsgetrf are called from lapack::cusolver::getrf. Since before the multistreams implementation all queues were effectively in-order using the cuda backend of intel/llvm, syncing queues that did not have the in_order queue property was not necessary.
The fix is to call:
cudaStream_t currentStreamId; CUSOLVER_ERROR_FUNC(cusolverDnGetStream, err, handle, ¤tStreamId); cuStreamSynchronize(currentStreamId);
after the cusolver functions in cases where depends_on is used. Since these cusolver functions are apparently asynchronous, the missing cuStreamSychronize meant that previously calls to depends_on where not waiting for the completed execution of the cusolver functions.
One solution would be to only sync the stream used by the interop onemkl_cusolver_host_task by calling:
cudaStream_t currentStreamId; CUSOLVER_ERROR_FUNC(cusolverDnGetStream, err, handle, ¤tStreamId); cuStreamSynchronize(currentStreamId);at the end of the
onemkl_cusolver_host_task
Would this achieve asynchronous submissions?
However I find that in all cases so far it is faster (or the same) to simply call
queue.wait()(which will sync all streams associoated with the queue) instead of using the above interop route to select the single offending stream.I have only fixed the cases that led to the failed tests so far. However I think that the test coverage is probably just not using large enough problems in some other cases to trigger the failure: therefore it may be a good idea to always call
queue.wait()afteronemkl_cusolver_host_taskis used.
If we go this route it may be roll the wait call into onemkl_cusolver_host_task. I made a temporary PR (#217) illustrating.
Note that is seems you cant rely on
depends_onsince this will not sync the stream used in the interop.
Interesting. Can you provide a reproducer?
Would this achieve asynchronous submissions?
Actually I think just from the observation that it fixes the test failures it must be effectively blocking future submissions (at least those that are touching the same memory as the cusolver function) until the native stream used in the cusolver functions is finished working (It may be that we observe this blocking behaviour because the context could be being created with the CU_CTX_BLOCKING_SYNC flag: see https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TYPES.html#group__CUDA__TYPES_1gg9f889e28a45a295b5c8ce13aa05f6cd462aebfe6432ade3feb32f1a409027852): for the use cases we are interested in where the stream is touching memory that other streams could touch I think we actually want to call queue.wait() anyway because we don't want any stream to touch the memory until the native stream is finished.
Interesting. Can you provide a reproducer?
The reproducer is the getrf tests that are calling the getrf function that uses depend_on here: https://github.com/oneapi-src/oneMKL/blob/61312ed98b8208999f99474778d46919c30ef15b/src/lapack/backends/cusolver/cusolver_lapack.cpp#L1350
If the depends_on was syncing the stream then the corresponding tests wouldn't fail.
I've now added blocking waits (using queue.wait()) to all places where they were missing. I think that this is required for generally correct behaviour: we should block all streams in a queue from touching memory in a subsequent command group until an initial submitted command group that touches the same memory is complete.
In the cases where depends_on was used I think that the expectation is that this should have achieved a blocking wait, so I'll set up an internal issue to investigate further why depends_on is apparently not working when used with a host task.
I've found out that these cusolver functions are apparently asynchronous, even though the Nvidia documentations implies that they are synchronous: therefore I think that depends_on is behaving correctly. Also I've been told that the correct procedure is to always synchronize the native stream directly within the host task if it has been used for asynchronous execution, since it is up to a SYCL implementation to decide on what it returns for a native stream, and therefore it may not be guaranteed that a call to queue.wait() will actually synchronize the native stream.
Therefore I will update the changes made to synchronize the native stream within the host_task and then use depends_on instead of queue.wait().
I've found out that these cusolver functions are apparently asynchronous, even though the Nvidia documentations implies that they are synchronous: therefore I think that
depends_onis behaving correctly. Also I've been told that the correct procedure is to always synchronize the native stream directly within the host task if it has been used for asynchronous execution, since it is up to a SYCL implementation to decide on what it returns for a native stream, and therefore it may not be guaranteed that a call to queue.wait() will actually synchronize the native stream.Therefore I will update the changes made to synchronize the native stream within the host_task and then use
depends_oninstead ofqueue.wait().
I've done this now.
@AidanBeltonS could you check this is all OK? Thanks
@AidanBeltonS could you check this is all OK? Thanks
LGTM. I think this approach is the simplest for handling cuSolver function asynchronous behaviour.
@ericlars what do you think this solution? I believe this is necessary as reading through the cuSolver docs it seems like these operations execute asynchronously. Although in most cases they don't. Docs: https://docs.nvidia.com/cuda/cusolver/index.html#asynchronous-execution
This means the host_task can finish before the cuSolver operation has completed. Calling a synchronising step after cuSolver will keep behaviour as expected. While we could synchronise after the host_task this will have issues if additional CUDA code needs to be placed after the cuSolver call which has a dependency on the cuSolver's output.
attaching log: log_llvm_cusolver_.txt