redGrapes CUDA Graphs

From latest information on CUDA Graphs, follow the following rules of thumb:

always use CUDA Graphs to start kernels, it will always be at least the same speed or faster as not using task graphs, even if you only schedule one or a few kernels
when starting an app with a new GPU context, allocate and free a large task graph first (e.g. the largest you expect to occur at runtime), which will reduce initial latencies
when tracing the performance of graphs, they are internally batched in groups of ~8 tasks again, but one does not need to optimize for that, it just becomes visible

Official Docs: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs

Aug 05 '19 14:08 ax3l

Currently I see two ways of using CUDA Graphs in combination with redGrapes:

for static, gpu-only tasks, create the graph manually with the CUDA API and submit it inside a redGrapes-task that gets handled by the cuda-scheduler, which in turn allocates a cuda-stream at the time it can be submitted. Generally, this would be exactly the same as with any async cuda call.
Capture all cuda calls with the graph capture, until a host-task appears that creates a dependency. Then the graph capture is ended and submitted.

Jul 01 '20 23:07 michaelsippel

@michaelsippel do you prefer one way to the other, or what are your thoughts on this matter generally?

Jul 02 '20 06:07 sbastrakov

From the CUDA docs it seems that it should be possible to call cudaGraphLaunch() while capturing a graph. This means that both could be used simultaneously. The question rather is if this is useful at all, because we would always create a new cuda-graph and never reuse it. I don't really believe that it would be faster to wrap every cuda call in a cuda-graph, because it would delay the submission of cuda operations and also introduces more runtime overhead. As I understand it, you can only really benefit from it if you use the cuda-graphs as template and resubmit the same graph multiple times, because it saves you the cpu-time to create the graph. If you want to do that, then this can be done manually, but I don't know how to extract a cuda-graph-template out of a dynamic task graph with mixed host- and device-tasks.

Jul 02 '20 21:07 michaelsippel

From the CUDA docs it seems that it should be possible to call cudaGraphLaunch() while capturing a graph. This means that both could be used simultaneously. The question rather is if this is useful at all, because we would always create a new cuda-graph and never reuse it. I don't really believe that it would be faster to wrap every cuda call in a cuda-graph, because it would delay the submission of cuda operations and also introduces more runtime overhead. As I understand it, you can only really benefit from it if you use the cuda-graphs as template and resubmit the same graph multiple times, because it saves you the cpu-time to create the graph. If you want to do that, then this can be done manually, but I don't know how to extract a cuda-graph-template out of a dynamic task graph with mixed host- and device-tasks.

It is definitive useful if we enqueue for example always an event behind the task to track the state of the operation. IMO it will also be useful if you know that you have more than two tasks ready to enqueue.

Jul 03 '20 12:07 psychocoderHPC

It is definitive useful if we enqueue for example always an event behind the task to track the state of the operation.

You mean, because we have to do this anyways, we always have at least two cuda operations (the actual call, and eventRecord) which can be recorded into the graph.

IMO it will also be useful if you know that you have more than two tasks ready to enqueue.

Yes, thats what I meant with the second bullet. I will post an image that explains how I imagine this to work. But I don't yet believe that it will gain any performance.

Jul 03 '20 13:07 michaelsippel

redGrapesCudaGraph


 1. mgr.finish_task( Host Task 1 )
   1.1. scheduling_graph.task_end( Host Task 1 )
     1.1.1. scheduling_graph.reach_event( Event 2 )


   1.2. mgr.activate_followers( Host Task 1 )
     1.2.1. scheduler.activate_task( Host Task 2 )
         ...

     1.2.2. scheduler.activate_task( CUDA Task 3 )
       1.2.2.1. cuda_scheduler.activate_task( CUDA Task 3 )

         1.2.2.1.1. scheduling_graph.is_task_ready( CUDA Task 3 ) == true
         1.2.2.1.1.1. scheduling_graph.is_event_reached( Event 5 ) == true

         1.2.2.1.2. cuda_scheduler.is_recording == false
         1.2.2.1.2.1. cudaStreamBeginCapture()

         1.2.2.1.2.2. cuda_scheduler.dispatch_task( CUDA Task 3 )
           1.2.2.1.2.2.1. current_stream = select next cuda stream

           1.2.2.1.2.2.2. mgr.run_task( CUDA Task 3 )
           1.2.2.1.2.2.2.1. scheduling_graph.reach_event( 5 )

           1.2.2.1.2.2.3. cuda_event_1 = cudaEventRecord()

           1.2.2.1.2.2.4. Add cuda event as cuda-dependency in following cuda-tasks.
           1.2.2.1.2.2.4.1. [CUDA Task 4].properties.cuda_dependencies[ current_stream ] = cuda_event_1


           1.2.2.1.2.2.5. mgr.activate_followers( CUDA Task 3 )
             1.2.2.1.2.2.5.1. scheduler.activate_task( CUDA Task 4 )
             1.2.2.1.2.2.5.1.1. cuda_scheduler.activate_task( CUDA Task 4 )
               1.2.2.1.2.2.5.1.1.1. scheduling_graph.is_task_ready( CUDA Task 4 ) == true
               1.2.2.1.2.2.5.1.1.1.1. scheduling_graph.is_event_reached( Event 7 ) == true

               1.2.2.1.2.2.5.1.1.2. cuda_scheduler.is_recording == true
               1.2.2.1.2.2.5.1.1.2.1. cuda_scheduler.dispatch_task( CUDA Task 3 )
                 1.2.2.1.2.2.5.1.1.2.1.1. current_stream = select next cuda stream

                 1.2.2.1.2.2.5.1.1.2.1.2. cudaWaitEvent( cuda_event_1, current_stream )

                 1.2.2.1.2.2.5.1.1.2.1.3. mgr.run_task( CUDA Task 3 )
                 1.2.2.1.2.2.5.1.1.2.1.3.1. scheduling_graph.reach_event( 5 )

                 1.2.2.1.2.2.5.1.1.2.1.4. cuda_event_2 = cudaEventRecord()

                 1.2.2.1.2.2.5.1.1.2.1.5. Add cuda event as cuda-dependency in following cuda-tasks.
                 1.2.2.1.2.2.5.1.1.2.1.5.1. [CUDA Task 7].properties.cuda_dependencies[ current_stream ] = cuda_event_2

                 1.2.2.1.2.2.5.1.1.2.1.6. mgr.activate_followers( CUDA Task 4 )
                   1.2.2.1.2.2.5.1.1.2.1.6.1. scheduler.activate_task( Host Task 6 )
                   1.2.2.1.2.2.5.1.1.2.1.6.1.1. default_scheduler.activate_task( Host Task 6 )
                   1.2.2.1.2.2.5.1.1.2.1.6.1.1.1. scheduling_graph.is_task_ready( Host Task 6 ) == false
                   1.2.2.1.2.2.5.1.1.2.1.6.1.1.1.1. scheduling_graph.is_event_reached( Event 11 ) == false

                   1.2.2.1.2.2.5.1.1.2.1.6.2. scheduler.activate_task( CUDA Task 7 )
                   1.2.2.1.2.2.5.1.1.2.1.6.2.1. cuda_scheduler.activate_task( CUDA Task 7 )
                   1.2.2.1.2.2.5.1.1.2.1.6.2.1.1. scheduling_graph.is_task_ready( CUDA Task 7 ) == false
                   1.2.2.1.2.2.5.1.1.2.1.6.2.1.1.1. scheduling_graph.is_event_reached( Event 13 ) == false

             1.2.2.1.2.2.5.2. scheduler.activate_task( Host Task 5 )
             1.2.2.1.2.2.5.2.1. default_scheduler.activate_task( Host Task 5 )
               1.2.2.1.2.2.5.2.1.1. scheduling_graph.is_task_ready( Host Task 5 ) == false
               1.2.2.1.2.2.5.2.1.1.1. scheduling_graph.is_event_reached( Event 9 ) == false

         1.2.2.1.2.3. cudaStreamEndCapture()
         1.2.2.1.2.4. cudaGraphLaunch()

 2. cuda_scheduler.poll()
   2.1. cudaEventQuery() == cudaSuccess
   2.1.1. mgr.task_finish( CUDA Task 3 )
     2.1.1.1. scheduling_graph.task_end( CUDA Task 3 )
     2.1.1.1.1. scheduling_graph.reach_event( Event 6 )

     2.1.1.2. mgr.activate_followers( CUDA Task 3 )
       2.1.1.2.1. scheduler.activate_task( CUDA Task 4 )
       2.1.1.2.1.1. cuda_scheduler.activate_task( CUDA Task 4 )
         2.1.1.2.1.1.1. CUDA Task 4 is already submitted

       2.1.1.2.2. scheduler.activate_task( Host Task 5 )
       2.1.1.2.2.1. default_scheduler.activate_task( Host Task 5 )
         2.1.1.2.2.1.1. scheduling_graph.is_task_ready( Host Task 5 ) == true
         2.1.1.2.2.1.1.1. scheduling_graph.is_event_reached( Event 9 ) == true

         2.1.1.2.2.1.2. run Host Task 5 ...

Jul 03 '20 16:07 michaelsippel

You mean, because we have to do this anyways, we always have at least two cuda operations (the actual call, and eventRecord) which can be recorded into the graph.

Yes in rhat case we have per task two cuda api calls.

But I don't yet believe that it will gain any performance

Not sure if you mean with performance latency We need to reduce the latency. In PIConGPU we have over 2k API calls per second. Each call is around 10us. So we waste a lot of time blocking the cpu with the CUDA API latency. The slowdown is not 2millisec because most API calls will be performed in parallel to the execution of tasks.

Jul 03 '20 19:07 psychocoderHPC