perftest
perftest copied to clipboard
SRV Load on NVidia
Hi, I was just testing and found the results very strange with Loads aligned on Pascal GPU.
Then I just looked to the code and see that you added multiple operation that will not ensure that your loads can't be aligned in your align test case.
Performance of Byteaddressbuffer can be similar to structured buffer if you add, that will ensure that the load address are correctly aligned in the shader :
const uint _WIDTH = LOAD_WIDTH * 4;
address = (address / _WIDTH) * _WIDTH;
Thanks for pointing this out. Originally the shader was properly aligned, but then I added the binary OR (with runtime value) to ensure that compilers don't optimize the whole loop to much smaller amount of wide loads.
Definitely need to fix this issue. I should do the binary OR before * (4 * LOAD_WIDTH).
Tested on RTX 2080 Ti (Turing) using 419.35 drivers. Neither multiply or divide + multiply works. Still getting same performance.
Now downloading new drivers to see whether Nvidia has finally implemented this optimization (I have been asking for it for long time).
Tested with 431.60 drivers too. Alignment doesn't seem to help Turing.
This is the new address calc code:
uint elemIdx = (htid + i) | loadConstants.elementsMask;
uint address = elemIdx * (4 * LOAD_WIDTH);
Adding your alignemtn code before load doesn't help either:
const uint _WIDTH = LOAD_WIDTH * 4;
address = (address / _WIDTH) * _WIDTH;
I don't currently have Pascal GPU to test with. Only RTX at work and Vega at home right now.
Could you run the test by replacing loadRawBody.hlsli with this:
#include "hash.hlsli"
#include "loadConstantsGPU.h"
RWBuffer<float> output : register(u0);
ByteAddressBuffer sourceData : register(t0);
cbuffer CB0 : register(b0)
{
LoadConstants loadConstants;
};
#define THREAD_GROUP_SIZE 256
groupshared float dummyLDS[THREAD_GROUP_SIZE];
[numthreads(THREAD_GROUP_SIZE, 1, 1)]
void main(uint3 tid : SV_DispatchThreadID, uint gix : SV_GroupIndex)
{
float4 value = 0.0;
#if defined(LOAD_INVARIANT)
// All threads load from same address. Index is wave invariant.
uint htid = 0;
#elif defined(LOAD_LINEAR)
// Linearly increasing starting address to allow memory coalescing
uint htid = gix;
#elif defined(LOAD_RANDOM)
// Randomize start address offset (0-15) to prevent memory coalescing
uint htid = hash1(gix) & 0xf;
#endif
[loop]
for (int i = 0; i < 256; ++i)
{
// Mask with runtime constant to prevent unwanted compiler optimizations
uint elemIdx = (htid + i) | loadConstants.elementsMask;
uint address = elemIdx * (4 * LOAD_WIDTH);
#if LOAD_WIDTH == 1
value += sourceData.Load(address).xxxx;
#elif LOAD_WIDTH == 2
value += sourceData.Load2(address).xyxy;
#elif LOAD_WIDTH == 3
value += sourceData.Load3(address).xyzx;
#elif LOAD_WIDTH == 4
value += sourceData.Load4(address).xyzw;
#endif
}
// Linear write to LDS (no bank conflicts). Significantly faster than memory loads.
dummyLDS[gix] = value.x + value.y + value.z + value.w;
GroupMemoryBarrierWithGroupSync();
// This branch is never taken, but the compiler doesn't know it
// Optimizer would remove all the memory loads if the data wouldn't be potentially used
[branch]
if (loadConstants.writeIndex != 0xffffffff)
{
output[tid.x + tid.y] = dummyLDS[loadConstants.writeIndex];
}
}
Sorry for the late reply.
On my side, there is an improvement. (but only on float4
)
It seems strange. I thought the improvement was on every typed load when I tried last time.
Tested with driver 436.30, on Pascal 1080 GTX:
Before your suggested modification:
ByteAddressBuffer.Load4 uniform: 54.473ms 0.516x
ByteAddressBuffer.Load4 linear: 91.940ms 0.306x
ByteAddressBuffer.Load4 random: 165.405ms 0.170x
ByteAddressBuffer.Load4 unaligned uniform: 54.514ms 0.516x
ByteAddressBuffer.Load4 unaligned linear: 92.463ms 0.304x
ByteAddressBuffer.Load4 unaligned random: 166.277ms 0.169x
StructuredBuffer<float4>.Load uniform: 1.477ms 19.050x
StructuredBuffer<float4>.Load linear: 53.750ms 0.523x
StructuredBuffer<float4>.Load random: 53.708ms 0.524x
After:
ByteAddressBuffer.Load4 uniform: 53.939ms 0.522x
ByteAddressBuffer.Load4 linear: 48.284ms 0.583x
ByteAddressBuffer.Load4 random: 55.305ms 0.509x
ByteAddressBuffer.Load4 unaligned uniform: 53.890ms 0.522x
ByteAddressBuffer.Load4 unaligned linear: 48.283ms 0.583x
ByteAddressBuffer.Load4 unaligned random: 55.304ms 0.509x
StructuredBuffer<float4>.Load uniform: 1.479ms 19.042x
StructuredBuffer<float4>.Load linear: 53.726ms 0.524x
StructuredBuffer<float4>.Load random: 53.751ms 0.524x
Same results with Ampere. No improvement there either.