oneMKL icon indicating copy to clipboard operation
oneMKL copied to clipboard

add queue.wait() to sync interop stream

Open JackAKirk opened this issue 3 years ago • 5 comments

Signed-off-by: JackAKirk [email protected]

Description

This is a bug fix for failures first identified since the multi-streams implementation of the cuda backend in intel/llvm (failures identified here https://github.com/oneapi-src/oneMKL/pull/209#issuecomment-1192001880): The failed tests are due to the lack of a stream synchronisation after some cusolver interop functions such as cusolverdnsgetrf are called from lapack::cusolver::getrf. Since before the multistreams implementation all queues were effectively in-order using the cuda backend of intel/llvm, syncing queues that did not have the in_order queue property was not necessary. The fix is to call: cudaStream_t currentStreamId; CUSOLVER_ERROR_FUNC(cusolverDnGetStream, err, handle, &currentStreamId); cuStreamSynchronize(currentStreamId);

after the cusolver functions in cases where depends_on is used. Since these cusolver functions are apparently asynchronous, the missing cuStreamSychronize meant that previously calls to depends_on where not waiting for the completed execution of the cusolver functions.

JackAKirk avatar Jul 29 '22 17:07 JackAKirk

One solution would be to only sync the stream used by the interop onemkl_cusolver_host_task by calling:

cudaStream_t currentStreamId;
            CUSOLVER_ERROR_FUNC(cusolverDnGetStream, err, handle, &currentStreamId);
            cuStreamSynchronize(currentStreamId);

at the end of the onemkl_cusolver_host_task

Would this achieve asynchronous submissions?

However I find that in all cases so far it is faster (or the same) to simply call queue.wait() (which will sync all streams associoated with the queue) instead of using the above interop route to select the single offending stream.

I have only fixed the cases that led to the failed tests so far. However I think that the test coverage is probably just not using large enough problems in some other cases to trigger the failure: therefore it may be a good idea to always call queue.wait() after onemkl_cusolver_host_task is used.

If we go this route it may be roll the wait call into onemkl_cusolver_host_task. I made a temporary PR (#217) illustrating.

Note that is seems you cant rely on depends_on since this will not sync the stream used in the interop.

Interesting. Can you provide a reproducer?

ericlars avatar Jul 30 '22 01:07 ericlars

Would this achieve asynchronous submissions?

Actually I think just from the observation that it fixes the test failures it must be effectively blocking future submissions (at least those that are touching the same memory as the cusolver function) until the native stream used in the cusolver functions is finished working (It may be that we observe this blocking behaviour because the context could be being created with the CU_CTX_BLOCKING_SYNC flag: see https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TYPES.html#group__CUDA__TYPES_1gg9f889e28a45a295b5c8ce13aa05f6cd462aebfe6432ade3feb32f1a409027852): for the use cases we are interested in where the stream is touching memory that other streams could touch I think we actually want to call queue.wait() anyway because we don't want any stream to touch the memory until the native stream is finished.

Interesting. Can you provide a reproducer?

The reproducer is the getrf tests that are calling the getrf function that uses depend_on here: https://github.com/oneapi-src/oneMKL/blob/61312ed98b8208999f99474778d46919c30ef15b/src/lapack/backends/cusolver/cusolver_lapack.cpp#L1350

If the depends_on was syncing the stream then the corresponding tests wouldn't fail.

JackAKirk avatar Aug 01 '22 17:08 JackAKirk

I've now added blocking waits (using queue.wait()) to all places where they were missing. I think that this is required for generally correct behaviour: we should block all streams in a queue from touching memory in a subsequent command group until an initial submitted command group that touches the same memory is complete. In the cases where depends_on was used I think that the expectation is that this should have achieved a blocking wait, so I'll set up an internal issue to investigate further why depends_on is apparently not working when used with a host task.

JackAKirk avatar Aug 09 '22 10:08 JackAKirk

I've found out that these cusolver functions are apparently asynchronous, even though the Nvidia documentations implies that they are synchronous: therefore I think that depends_on is behaving correctly. Also I've been told that the correct procedure is to always synchronize the native stream directly within the host task if it has been used for asynchronous execution, since it is up to a SYCL implementation to decide on what it returns for a native stream, and therefore it may not be guaranteed that a call to queue.wait() will actually synchronize the native stream.

Therefore I will update the changes made to synchronize the native stream within the host_task and then use depends_on instead of queue.wait().

JackAKirk avatar Aug 09 '22 11:08 JackAKirk

I've found out that these cusolver functions are apparently asynchronous, even though the Nvidia documentations implies that they are synchronous: therefore I think that depends_on is behaving correctly. Also I've been told that the correct procedure is to always synchronize the native stream directly within the host task if it has been used for asynchronous execution, since it is up to a SYCL implementation to decide on what it returns for a native stream, and therefore it may not be guaranteed that a call to queue.wait() will actually synchronize the native stream.

Therefore I will update the changes made to synchronize the native stream within the host_task and then use depends_on instead of queue.wait().

I've done this now.

JackAKirk avatar Aug 09 '22 14:08 JackAKirk

@AidanBeltonS could you check this is all OK? Thanks

JackAKirk avatar Aug 19 '22 14:08 JackAKirk

@AidanBeltonS could you check this is all OK? Thanks

LGTM. I think this approach is the simplest for handling cuSolver function asynchronous behaviour.

AidanBeltonS avatar Aug 22 '22 08:08 AidanBeltonS

@ericlars what do you think this solution? I believe this is necessary as reading through the cuSolver docs it seems like these operations execute asynchronously. Although in most cases they don't. Docs: https://docs.nvidia.com/cuda/cusolver/index.html#asynchronous-execution

This means the host_task can finish before the cuSolver operation has completed. Calling a synchronising step after cuSolver will keep behaviour as expected. While we could synchronise after the host_task this will have issues if additional CUDA code needs to be placed after the cuSolver call which has a dependency on the cuSolver's output.

AidanBeltonS avatar Aug 24 '22 14:08 AidanBeltonS

attaching log: log_llvm_cusolver_.txt

ericlars avatar Sep 02 '22 08:09 ericlars