ILGPU icon indicating copy to clipboard operation
ILGPU copied to clipboard

[QUESTION]: Kernel runs much slower on NVIDIA GPU than on Intel onboard GPU. Any ideas?

Open AmosEgel opened this issue 5 months ago • 0 comments

Question

My laptop has two GPUs:

  1. Intel(R) UHD Graphics 620
  2. NVIDIA GeForce MX250

I'd expect that the performance with the NVIDIA GPU is at least as good as with the Intel GPU. However, the opposite is the case. We measure 50 to 100 times better performance with the Intel GPU, see the below benchmark.

Note that this is just a toy program, but for our real kernel, the acceleration on all NVIDIA GPUs that we tested were painfully slow (much slower compared to a single-threaded serial execution on the CPU).

One thing that I would like to check is if that is due to the hardware or due to the CUDA versus OpenCL accelerator.

Is there a possibility to run the kernel on the NVIDIA GPU but with OpenCL using ILGPU?

Do you guys have any other advice what I could try to improve the performance on the NVIDIA GPU?

Environment

  • ILGPU version: 1.5.1
  • .NET version: Framework 4.7.2, have tried also .NET 8
  • Operating system: Windows 10
  • Hardware (if GPU-related): NVIDIA GeForce MX250 and several other NVIDIA GPUs

Additional context

This is the kernel:

static void Kernel(Index1D i, ArrayView<float> outputData)
{
    XorShift64Star rng = new XorShift64Star((ulong)(i.X + 1));
    float result = 0;
    for (int iRay = 0; iRay < 10000; iRay++)
    {
        float rpX = rng.NextFloat() - 0.5F;
        float rpY = rng.NextFloat() - 0.5F;

        float rdX = rng.NextFloat() - 0.5F;
        float rdY = rng.NextFloat() - 0.5F;
        float rdZ = MathF.Sqrt(1.0F - rdX * rdX - rdY * rdY);

        float alpha = -MathF.Atan2(rpY, rpX);
        float sin = MathF.Sin(alpha);
        float cos = MathF.Cos(alpha);

        float r = cos * rpX - sin * rpY;
        float dx = cos * rdX - sin * rdY;
        float dy = sin * rdX + cos * rdY;

        float theta = MathF.Acos(rdZ);
        float phi = MathF.Atan2(dy, dx);
        result = result + phi + theta;
    }
    outputData[i] = result;
}

This is the method with which we launch the kernel and measure the run time:

public static void Launch(string deviceName)
{
    using (Context context = Context.Create(builder => builder.Default().EnableAlgorithms()))
    {
        Device device = GetDevice(deviceName, context);
        using (Accelerator accelerator = device.CreateAccelerator(context))
        {
            MemoryBuffer1D<float, Stride1D.Dense> deviceOutput = accelerator.Allocate1D<float>(1024);
            var loadedKernel = accelerator.LoadAutoGroupedStreamKernel<Index1D, ArrayView<float>>(Kernel);
            Stopwatch sw = new Stopwatch();
            sw.Start();
            loadedKernel(1024, deviceOutput.View);
            accelerator.Synchronize();
            sw.Stop();
            Console.WriteLine($"Kernel time on {deviceName}: {sw.Elapsed}");
        }
    }
}

with the following helper function:

private static Device GetDevice(string deviceName, Context context)
{
    foreach (Device dev in context)
    {
        if (dev.Name == deviceName) return dev;
    }
    throw new ArgumentException("Cannot find device " + deviceName);
}

AmosEgel avatar Sep 16 '24 17:09 AmosEgel