ILGPU
ILGPU copied to clipboard
Add an example to show use cases of As2DView and As3DView
I'm trying to migrate my library to the new (beta) version of ILGPU library. The ArrayView2D
and related types are different now - we need to specify Stride2D.DenseX
or DenseY
. My data at CPU are float[,]
. I understand what is the difference between DenseX and DenseY but still confused by performance result:
22460 [ms] DenseX ~50% GPU 14995 [ms] DenseY ~75% GPU 1.5x faster !!!
If I allocate the buffer with DenseY
the code is 1.5x faster and the GPU is more utilized - I just copy from/to CPU nothing else. I expected that DenseY
will be slower because it performs translation - according to example here. What is interesting however that few lines below this example you say the opposite:
Stride2D.DenseX represents 2D strides that pack elements side by side in dimension X (transfers from a to views with this stride involve transpose operations).
Also I'm not sure which stride is faster for 3D views.
Could you explain why I see such performance results?
Here is my example code:
Random rnd = new Random();
float[,] xInp = new float[Size, Size];
for (int nr = 0; nr < Size; ++nr)
{
for (int nc = 0; nc < Size; ++nc)
xInp[nr, nc] = (float)rnd.NextDouble();
}
using (var ctx = Context.CreateDefault())
using (var acc = ctx.CreateCudaAccelerator(0))
{
using (var buffX = acc.Allocate2DDenseX<float>(new Index2D(Size, Size)))
using (var buffY = acc.Allocate2DDenseX<float>(new Index2D(Size, Size)))
{
for (int ndx = 0; ndx < Count; ++ndx)
{
buffX.View.CopyFromCPU(xInp);
buffY.View.CopyFromCPU(xInp);
float[,] xData = buffX.View.GetAsArray2D();
float[,] yData = buffY.View.GetAsArray2D();
}
}
}
For ArrayView2D DenseY matches how C# organizes 2d arrays, this removed the need for a transpose operation when GetAsArray2D is called. For 3D arrays I believe the ~~DenseYX~~ DenseZY stride is the most performant.
@NullandKale is right. DenseY
corresponds to the "most efficient" data layout you can choose for the interop between .Net arrays and ILGPU kernels. Maybe this is also beneficial in terms of your memory access pattern within the kernel?
Thanks for the answer. I recommend to fix the comment in the example code because it is confusing.
// Simple 2D allocation of 1024 * 1024 longs using the array provided with TStride = Stride2D.DenseX // (all elements in X dimension are accessed contiguously in memory) // -> this will not transpose the input buffer as the memory layout will be identical on CPU and GPU
My new documentation which should be merged in the next few days covers this already!
If you want to view it before it gets merged it is available here.
This is the memory tutorial.
@NullandKale I read your tutorial and it is awesome! :-)
Please update it with shared static/dynamic memory and array transformation SubView and As2DView / As3DView examples. Currently I have no clue how to use the 1D to 2D/3D view transformation. The TOtherStride argument made it really hard :-(.