A way to obtain better performance out of CUDA is to make use of multiple streams in order to overlap computations, at least with copies (in/out) unrelated to the currently running operation.

Potential of this technique for the current solvers in Ginkgo

This problem arises in Ginkgo as can be seen in the following screenshot taken from PR #168 where we see some important white space surrounding the data exchange between CPU and device (both ways) for the stopping criterion and the memory copy of the identity matrix preconditionner application.

What follows are two points providing an evaluation of the benefits of such a technique with the current status of Ginkgo. Note that this is limited to the currently CUDA-enhanced solvers and there may be greater benefits on a later date.

the preconditionner-related memory copy could be overlapped with computation in some cases. For BiCGSTAB, one of the preconditionner-apply could be overlapped with the previous criterion check. It is to be noted though that currently in many cases such as overlap isn't readily available in other solvers due to strong consecutive data dependencies (read after write), including with the previous operations from the previous iteration.
for the white related part of the criterion check, some of it could be avoided as there are currently two synchronizations with boolean copy related to copying the data from the host to the device and from the device to the host. At least one of them could be removed by using asynchronous copies, but a strong synchronization between the host and the device is required in order to do the criterion check. This last synchronization could be removed if we have a loose checking of the stopping criterion (but it also requires backing up the solution vector and the residual for multiple iterations). Such a technique would allow to fully process all solvers asynchronously at the expense of a few extra computations once convergence has been reached due to a lack of reactivity.

In general, because Ginkgo is very synchronous in its current way of working and has several strong synchronization points, simply switching to asynchronous copies may only have a limited impact on performance without partially redesigning the algorithms to expose some concurrency and/or making use of asynchronous convergence checking. It could therefore be argued that such a change to asynchronous memory transfers should be in the framework of a more global asynchronous parallelization effort of Ginkgo rather than as a standalone improvement.

CUDA's APIs helping manage asynchronous computations and communications

There are two APIs available in CUDA for this:

The legacy cudaStream and CudaEvent APIs.
Since CUDA 10.0 the cudaGraph API

Although the cudaGraph API looks easier to use and provides a clearer model, using this API would restrict Ginkgo to CUDA 10.0 or more which isn't feasible right now as this is the last version of CUDA. In two or three CUDA releases this may be more viable.

Models of asynchronous computations for Ginkgo

What follows are some idea of implementations. This is only a partial exposition and more thoughts and discussions may be required before practical use.

The heart of the problem is situated in the CudaExecutor which implements the actual CUDA memory transfers. The model we could go with would be the following: one stream for transfers from host to device (input stream), one for kernel launches and one for transfers from device to host (output stream). This models gives the potential of overlapping either of the memory transfers with the kernel launches while keeping the kernel launches synchronous in respect to each other. Nonetheless, explicit synchronizations are necessary:

The kernel needs to wait on its inputs transfers
The output stream needs to wait on its respective kernel's execution
Eventually, the host (CPU thread) needs to wait on the output stream's results (such as for the synchronous stopping criterion)

Due to this, it is necessary to identify multiple cudaEvents, one after the submission of all the kernel's inputs, one after the submission of the kernel and one after the submission of the output transfers and call cudaStreamEventWait on them at the correct time. The burden of calling the interface functions for these algorithms would fall onto the user. Therefore, all the CUDA algorithms would need to be rewritten for:

Using streams
Adding the correct synchronization points by instantiating/popping cuda events

Personal conclusion

My personal conclusion for the moment, except if I missed something obvious, is that this is overly complicated for little improvements due to the current limitations of Ginkgo. I think going in this direction would make more sense when we design a framework for multi-GPU execution.

Dec 05 '18 11:12 tcojean

@tcojean thanks for the comprehensive description - I mostly agree with everything said. Here are some comments and suggestions:

I agree that just focusing on replacing the synchronous API with the asynchronous one does not make a lot of sense, and probably won't give us noticeable improvements. There are several issues with the programming / machine model currently used in Ginkgo which should be fixed before or in conjunction with adding asynchronous behavior.

Designing the interface for asynchronous communication

Due to this, it is necessary to identify multiple cudaEvents, one after the submission of all the kernel's inputs, one after the submission of the kernel and one after the submission of the output transfers

The implementation and interface where we have only 1 event after all the data has been transferred (and thus, 1 dependency between data and computation) relies heavily on the fact that the underlying implementation uses the stream API, and that there is implicit synchronization between consecutive memory copies since they are scheduled on the same stream. I am not sure that stream API is the future, and we should probably develop an interface that can be implemented either using the stream or the graph API. That can be done by explicitly adding a dependency from every input to the kernel. Of course, we should benchmark this and see whether synchronizing multiple events has noticeable overhead as opposed to synchronizing only one event.

Unified memory

CUDA is closing the gap between unified memory and explicitly managed memory with every new GPU generation. According to this blog post using unified memory is virtually as efficient as the use of explicit copies, as long as the runtime receives hints on where the next access is going to happen. If the performance is (roughly) the same, unified memory has quite a bit of advantages over explicit memory copies: it's easier to support large data sets that do not fit into GPU memory (an initial solution can just rely on unified memory and transfer data as needed, and later the transfers can be optimized by providing hints to the runtime); trying to access data that is not on the local memory space no longer results in a runtime error, instead it only has some performance implications.

I think we will definitely have to keep in mind that unified memory may become the standard. For this, our executor model would have to change and not assume that the partitioning of memory spaces follows the partitioning of execution spaces (more on that in https://gitlab.com/ginkgo-project/ginkgo/issues/85). That would probably mean that we add a separate abstraction to represent memory spaces, not directly tied to the executor. The executor would then be constructed with references to memory spaces it can access when performing computations.

Dec 05 '18 13:12 gflegar

Technical details

Here I want to add some suggestions about the possible implementation of low-level primitives we'll need for asynchronous behavior.

Executors

The executor's memory copy and operation launch functions could use asynchronous copies and asynchronous kernel launches, and return a handle to the copy / kernel launch operation that can be used to wait for the copy to complete, or schedule an operation after the copy was completed:

// uses asynchronous copy and returns a handle that can be used to synchronize
template <typename T>
async_handle Executor::copy_from(
    const Executor *src_exec, size_type num_elems,
    const T *src_ptr, T *dest_ptr) const;

// schedules an asynchronous operation and returns a handle use to synchronize
async_handle Executor::run(const Operation &op) const;

// the current thread waits for the operation referenced by the handle to complete
void async_handle::wait();

// next is executed only after the current handle completes
async_handle async_handle::then(async_handle next);

// Groups multiple handles into one to allow multiple dependencies
template<typename... Handles)
async_handle when_all(Handles... handles);
// e.g. to wait on both h1 and h2 before executing h3: when_all(h1, h2).then(h3).wait()

Arrays

Arrays use the underlying executor to do the copies, so they have to save the handle and make it available to the user. I'm still not sure what's the best way to do it. async_handle get_handle() is the obvious solution, though it's probably not the optimal one to make writing kernels simple. The problem is that the copy of an array is triggered using the assignment operator. While we could return the async_handle as the result, this is not the expected behavior of the assignment operator, as it should return a reference to the left operand.

Using get_handle() it would look something like this:

(a1 = a2).get_handle().then(exec->run(operation)).wait();

It might be better to just have a copy method, or a wrapper:

copy(a2, a1).then(exec->run(operation)).wait();

Or something similar...

Standard C++ support and other libraries

The C++11 standard adds some limited support for expressing asynchronous operations. Namely, these are std::async which allows to launch an operation asynchronously and std::future that represents the handle to the asynchronous operation. The std::future::wait() method can be used to synchronize the operation with the running thread once it is completed.

The standard still does not support a way to queue multiple operations and to build an execution graph, but that is part of concurrency TS which will (hopefully) become a part of a future standard. That one provides an extended future std::experimental::future that also contains the then method with the same meaning as above, and when_all and when_any functions to help build the graph. Also, HPX provides an improved implementation of futures with all of these features (and support for accelerators). I think they are the ones pushing the concurrency TS effort in the standard.

The point of this section is not that we should use the standard C++ futures, or get the whole HPX as a dependency (though its probably far more stable now than last year), but that whatever we design should probably have the same interface as (or a superset of) the one used in the standard. This would make Ginkgo easier to use, and simplify interoperability with other libraries.

Another interesting project is ReactiveX. It is very popular in web development (the whole Angular framework we use for GPE is basically built on top of it), and there is a C++ implementation. It's primary goal is not exactly HPC, runtime systems, or even asynchronous execution. But looking at the idea, I feel that it's just a slightly different view of execution graphs. Not saying we should do anything about it, but we should keep in mind their approach to concurrency when designing our interface, there might be some interesting ideas there.

Dec 05 '18 14:12 gflegar

CUDA kernel launches

This is not exactly the topic of this issue, but I feel it might be quite related. We currently launch CUDA kernels using the <<< >>> syntax. Unfortunately, we forget that the kernel launch can fail, and do not check the last error immediately after doing it. Also, kernels launched with <<< >>> cannot use global device synchronization introduced in Volta (to do that cudaLaunchCooperativeKernel has to be used, see Runtime API). Any syntax highlighting / parsing tool that does not explicitly understand CUDA will have problems with the <<< >>> syntax. Handling different version of kernels (depending on compute capabilities) is also problematic.

To solve these problems, we could provide wrappers for kernel launches, that use the appropriate version of the kernel, depending on the detected GPU, use the correct runtime API call if the kernel needs device synchronization or dynamic parallelism, check the errors on kernel launch, add some post-processing to figure out if there were any device-side errors, etc.

Related to asynchronous copies, this kernel wrapper could also be the place where synchronization with input copy events happens.

Dec 05 '18 14:12 gflegar

Related to #1206

Nov 22 '22 16:11 upsj

ginkgo
ginkgo copied to clipboard

Use CUDA's asynchronous copy calls in the CudaExecutor class

Potential of this technique for the current solvers in Ginkgo

CUDA's APIs helping manage asynchronous computations and communications

Models of asynchronous computations for Ginkgo

Personal conclusion

Designing the interface for asynchronous communication

Unified memory

Technical details

Executors

Arrays

Standard C++ support and other libraries

CUDA kernel launches

ginkgo ginkgo copied to clipboard

Use CUDA's asynchronous copy calls in the CudaExecutor class

Potential of this technique for the current solvers in Ginkgo

CUDA's APIs helping manage asynchronous computations and communications

Models of asynchronous computations for Ginkgo

Personal conclusion

Designing the interface for asynchronous communication

Unified memory

Technical details

Executors

Arrays

Standard C++ support and other libraries

CUDA kernel launches

ginkgo
ginkgo copied to clipboard