cuda-samples CUDA Dynamic Parallelism Synchronization (CUDA

Hello, I have to use cudaDeviceSynchronize kind of function to wait to kernel to get finished but we can not use any kind of synchronization at device functions after version 11.6

May I request an example for it?

Here's the code that I'm trying to run:

__global__ void NNFeedForwardNormalMultiple(double* __restrict__ values, double* __restrict__ weigths, double* result, int inputsize, int outputsize) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    int outputidx = idx / outputsize;
    int inputidx = idx % outputsize;

    if (outputidx >= outputsize || inputidx >= inputsize) {
        return;
    }

    atomicAdd(&result[outputidx], values[inputidx] * weigths[outputsize*outputidx + inputidx]);
}

__global__ void NNFeedForwardNormalActivate(double* __restrict__ biases, double* result, int size) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;

    if (idx >= size) {
        return;
    }

    result[idx] = 1.0 / (1.0 + exp(-(result[idx] + biases[idx])));
}

__global__ void NNFeedForwardNormal(double* __restrict__ values, double* __restrict__ weigths, double* result, double* __restrict__ biases, int inputsize, int outputsize) {
    int blocksize = (inputsize * outputsize + THREADS_PER_BLOCK - 1)/THREADS_PER_BLOCK;
    NNFeedForwardNormalMultiple<<<blocksize, THREADS_PER_BLOCK>>>(values, weigths, result, inputsize, outputsize);
    //normally cudaDeviceSynchronize() kind of function to wait for child kernel to finish;
    NNFeedForwardNormalActivate<<<(outputsize + THREADS_PER_BLOCK - 1)/THREADS_PER_BLOCK, THREADS_PER_BLOCK>>>(biases, result, outputsize);
}

Thanks!

Mar 07 '25 02:03 ug0x01

I'm not sure what you're referring to; cudaDeviceSynchronize() is still a valid and supported API. Usually, though, you don't want to launch all of your work into the default stream but rather use streams explicitly, in which case you'd use cudaStreamSynchronize(). Also keep in mind that any work launched into a stream will still complete sequentially - what I mean by that is that if you launch kernel_a, kernel_b, and kernel_c into a stream they'll run in that order.

Mar 10 '25 21:03 rwarmstr

Hello @rwarmstr,

I'm not sure what you're referring to; cudaDeviceSynchronize() is still a valid and supported API

No, it's not supported to be called from a device or global function after CUDA 11.6. Also I tried cudaStreamSynchronize and other synchronizations too but none of them could be called from a device function.

Also keep in mind that any work launched into a stream will still complete sequentially

Also no, when you call 2 different device functions without a synchronization, it will run almost at the same time, and in a case like mine (which makes some calculations which will take some time) it will try to activate neurons before matrix multiplication and it will add some more value after activation because the first kernel will be still running

Mar 11 '25 11:03 ug0x01

cudaDeviceSynchronize() is not a device-side API and wouldn't make sense for it to be called from within a device function - by definition, if the function is waiting for a synchronization inline with an active warp, it's still running.

For the second point, there are some nuances in execution you can see if for instance CTAs (blocks) from different kernels are pulled to schedule on idle SMs while some CTAs from the prior invocation remain active. What I meant with my response above is that CTAs are essentially pulled from a queue sequentially, but you may wind up with CTAs from one kernel simultaneously active with CTAs from another kernel. If you want to avoid that, use CUDA events to synchronize executions.

Apr 02 '25 23:04 rwarmstr

CUDA Dynamic Parallelism Synchronization (CUDA > 11.6)