ILGPU icon indicating copy to clipboard operation
ILGPU copied to clipboard

Add an example to show use cases of As2DView and As3DView

Open CsabaStupak opened this issue 2 years ago • 5 comments

I'm trying to migrate my library to the new (beta) version of ILGPU library. The ArrayView2D and related types are different now - we need to specify Stride2D.DenseX or DenseY. My data at CPU are float[,]. I understand what is the difference between DenseX and DenseY but still confused by performance result:

22460 [ms] DenseX ~50% GPU 14995 [ms] DenseY ~75% GPU 1.5x faster !!!

If I allocate the buffer with DenseY the code is 1.5x faster and the GPU is more utilized - I just copy from/to CPU nothing else. I expected that DenseY will be slower because it performs translation - according to example here. What is interesting however that few lines below this example you say the opposite:

Stride2D.DenseX represents 2D strides that pack elements side by side in dimension X (transfers from a to views with this stride involve transpose operations).

Also I'm not sure which stride is faster for 3D views.

Could you explain why I see such performance results?

Here is my example code:

Random rnd = new Random();
float[,] xInp = new float[Size, Size];
for (int nr = 0; nr < Size; ++nr)
{
	for (int nc = 0; nc < Size; ++nc)
		xInp[nr, nc] = (float)rnd.NextDouble();
}

using (var ctx = Context.CreateDefault())
using (var acc = ctx.CreateCudaAccelerator(0))
{
	using (var buffX = acc.Allocate2DDenseX<float>(new Index2D(Size, Size)))
	using (var buffY = acc.Allocate2DDenseX<float>(new Index2D(Size, Size)))
	{
		for (int ndx = 0; ndx < Count; ++ndx)
		{
			buffX.View.CopyFromCPU(xInp);
			buffY.View.CopyFromCPU(xInp);

			float[,] xData = buffX.View.GetAsArray2D();
			float[,] yData = buffY.View.GetAsArray2D();
		}
	}
}

CsabaStupak avatar Jul 18 '21 07:07 CsabaStupak

For ArrayView2D DenseY matches how C# organizes 2d arrays, this removed the need for a transpose operation when GetAsArray2D is called. For 3D arrays I believe the ~~DenseYX~~ DenseZY stride is the most performant.

NullandKale avatar Jul 18 '21 16:07 NullandKale

@NullandKale is right. DenseY corresponds to the "most efficient" data layout you can choose for the interop between .Net arrays and ILGPU kernels. Maybe this is also beneficial in terms of your memory access pattern within the kernel?

m4rs-mt avatar Jul 18 '21 19:07 m4rs-mt

Thanks for the answer. I recommend to fix the comment in the example code because it is confusing.

// Simple 2D allocation of 1024 * 1024 longs using the array provided with TStride = Stride2D.DenseX // (all elements in X dimension are accessed contiguously in memory) // -> this will not transpose the input buffer as the memory layout will be identical on CPU and GPU

CsabaStupak avatar Jul 19 '21 08:07 CsabaStupak

My new documentation which should be merged in the next few days covers this already!

If you want to view it before it gets merged it is available here.

This is the memory tutorial.

NullandKale avatar Jul 19 '21 17:07 NullandKale

@NullandKale I read your tutorial and it is awesome! :-)

Please update it with shared static/dynamic memory and array transformation SubView and As2DView / As3DView examples. Currently I have no clue how to use the 1D to 2D/3D view transformation. The TOtherStride argument made it really hard :-(.

CsabaStupak avatar Jul 24 '21 09:07 CsabaStupak