perftest icon indicating copy to clipboard operation
perftest copied to clipboard

SRV Load on NVidia

Open zlnimda opened this issue 5 years ago • 6 comments

Hi, I was just testing and found the results very strange with Loads aligned on Pascal GPU.

Then I just looked to the code and see that you added multiple operation that will not ensure that your loads can't be aligned in your align test case.

Performance of Byteaddressbuffer can be similar to structured buffer if you add, that will ensure that the load address are correctly aligned in the shader :

const uint _WIDTH = LOAD_WIDTH * 4;
address = (address / _WIDTH) * _WIDTH;

zlnimda avatar Jun 13 '19 12:06 zlnimda

Thanks for pointing this out. Originally the shader was properly aligned, but then I added the binary OR (with runtime value) to ensure that compilers don't optimize the whole loop to much smaller amount of wide loads.

Definitely need to fix this issue. I should do the binary OR before * (4 * LOAD_WIDTH).

sebbbi avatar Jul 30 '19 08:07 sebbbi

Tested on RTX 2080 Ti (Turing) using 419.35 drivers. Neither multiply or divide + multiply works. Still getting same performance.

Now downloading new drivers to see whether Nvidia has finally implemented this optimization (I have been asking for it for long time).

sebbbi avatar Jul 30 '19 13:07 sebbbi

Tested with 431.60 drivers too. Alignment doesn't seem to help Turing.

This is the new address calc code:

		uint elemIdx = (htid + i) | loadConstants.elementsMask;
		uint address = elemIdx * (4 * LOAD_WIDTH);

Adding your alignemtn code before load doesn't help either:

		const uint _WIDTH = LOAD_WIDTH * 4;
		address = (address / _WIDTH) * _WIDTH;

I don't currently have Pascal GPU to test with. Only RTX at work and Vega at home right now.

sebbbi avatar Jul 30 '19 13:07 sebbbi

Could you run the test by replacing loadRawBody.hlsli with this:

#include "hash.hlsli"
#include "loadConstantsGPU.h"

RWBuffer<float> output : register(u0);
ByteAddressBuffer sourceData : register(t0);

cbuffer CB0 : register(b0)
{
	LoadConstants loadConstants;
};

#define THREAD_GROUP_SIZE 256

groupshared float dummyLDS[THREAD_GROUP_SIZE];

[numthreads(THREAD_GROUP_SIZE, 1, 1)]
void main(uint3 tid : SV_DispatchThreadID, uint gix : SV_GroupIndex)
{
	float4 value = 0.0;
	
#if defined(LOAD_INVARIANT)
    // All threads load from same address. Index is wave invariant.
	uint htid = 0;
#elif defined(LOAD_LINEAR)
	// Linearly increasing starting address to allow memory coalescing
	uint htid = gix;
#elif defined(LOAD_RANDOM)
    // Randomize start address offset (0-15) to prevent memory coalescing
	uint htid = hash1(gix) & 0xf;
#endif

	[loop]
	for (int i = 0; i < 256; ++i)
	{
		// Mask with runtime constant to prevent unwanted compiler optimizations
		uint elemIdx = (htid + i) | loadConstants.elementsMask;
		uint address = elemIdx * (4 * LOAD_WIDTH);

#if LOAD_WIDTH == 1
		value += sourceData.Load(address).xxxx;
#elif LOAD_WIDTH == 2
		value += sourceData.Load2(address).xyxy;
#elif LOAD_WIDTH == 3
		value += sourceData.Load3(address).xyzx; 
#elif LOAD_WIDTH == 4
		value += sourceData.Load4(address).xyzw;
#endif
	}

	// Linear write to LDS (no bank conflicts). Significantly faster than memory loads.
	dummyLDS[gix] = value.x + value.y + value.z + value.w;

	GroupMemoryBarrierWithGroupSync();

	// This branch is never taken, but the compiler doesn't know it
	// Optimizer would remove all the memory loads if the data wouldn't be potentially used
	[branch]
	if (loadConstants.writeIndex != 0xffffffff)
	{
        output[tid.x + tid.y] = dummyLDS[loadConstants.writeIndex];
    }
}

sebbbi avatar Jul 30 '19 13:07 sebbbi

Sorry for the late reply. On my side, there is an improvement. (but only on float4) It seems strange. I thought the improvement was on every typed load when I tried last time.

Tested with driver 436.30, on Pascal 1080 GTX:

Before your suggested modification:

ByteAddressBuffer.Load4 uniform: 54.473ms 0.516x
ByteAddressBuffer.Load4 linear: 91.940ms 0.306x
ByteAddressBuffer.Load4 random: 165.405ms 0.170x

ByteAddressBuffer.Load4 unaligned uniform: 54.514ms 0.516x
ByteAddressBuffer.Load4 unaligned linear: 92.463ms 0.304x
ByteAddressBuffer.Load4 unaligned random: 166.277ms 0.169x

StructuredBuffer<float4>.Load uniform: 1.477ms 19.050x
StructuredBuffer<float4>.Load linear: 53.750ms 0.523x
StructuredBuffer<float4>.Load random: 53.708ms 0.524x

After:

ByteAddressBuffer.Load4 uniform: 53.939ms 0.522x
ByteAddressBuffer.Load4 linear: 48.284ms 0.583x
ByteAddressBuffer.Load4 random: 55.305ms 0.509x

ByteAddressBuffer.Load4 unaligned uniform: 53.890ms 0.522x
ByteAddressBuffer.Load4 unaligned linear: 48.283ms 0.583x
ByteAddressBuffer.Load4 unaligned random: 55.304ms 0.509x

StructuredBuffer<float4>.Load uniform: 1.479ms 19.042x
StructuredBuffer<float4>.Load linear: 53.726ms 0.524x
StructuredBuffer<float4>.Load random: 53.751ms 0.524x

zlnimda avatar Oct 21 '19 12:10 zlnimda

Same results with Ampere. No improvement there either.

sebbbi avatar Nov 18 '20 14:11 sebbbi