ispc icon indicating copy to clipboard operation
ispc copied to clipboard

Short uniform vector real size

Open nurmukhametov opened this issue 1 year ago • 11 comments

It looks like the short vector size is actually rounded up to the next power of 2. It is probably done for small sizes due to some performance concerns (e.g., 3 to 4), despite the fact that it is error prone (easy to write an out-of-bound read/write). If short vector has bigger size then it also rounded up. For the example below (https://godbolt.org/z/1P7TE961o), the real size if 512 instead of expected 400.

uniform int foo() {
    return sizeof(uniform float<100>);
}
foo___:                                 # @foo___
        mov     eax, 512
        ret

Rounding is implemented here https://github.com/ispc/ispc/blob/80dccf9c2ff614cb5e144250757e80b1d2263f3e/src/type.cpp#L1992-L2004

nurmukhametov avatar Nov 12 '24 16:11 nurmukhametov

Note that we changes this part of short vectors definition several times forth and back - every time by users requests. And every time we broke something. It's quite sensitive nuance that affects real user code.

dbabokin avatar Nov 12 '24 19:11 dbabokin

The problem that I encountered that the deference of uniform float<N> *uniform is often out-of-bound read/write. Consider the following example (https://godbolt.org/z/nYbc5Pa3v):

#define N 10
uniform float<N> foo(uniform float<N> *uniform ptr) {
    return *ptr;
}

The generated code actually reads 16 floats instead of 10.

foo___un_3C_unf_3C_10_3E__3E_:          # @foo___un_3C_unf_3C_10_3E__3E_
        vmovaps ymm0, ymmword ptr [rdi]
        vmovaps ymm1, ymmword ptr [rdi + 32]
        ret

Here we have memory errors for all N values except powers of 2.

nurmukhametov avatar Nov 12 '24 19:11 nurmukhametov

I think there are two different things here to consider:

  • the size the data occupies when it's a part of a struct - i.e. how much padding it needs
  • how much data is actually read. Here it's a bit tricky, as reading smaller number of values (for correctness) may force code gen to use slower reads - you cannot use vector loads, even masked (I think), as in this case the full vector size needs to be loaded and masking is applied afterwards triggering all out of bound faults anyways. So this is done to enable vector loads/stores for short vectors.

dbabokin avatar Nov 12 '24 20:11 dbabokin

you cannot use vector loads, even masked (I think), as in this case the full vector size needs to be loaded and masking is applied afterwards triggering all out of bound faults anyways.

Masked movs should be fine.

Faults occur only due to mask-bit required memory accesses that caused the faults. Faults will not occur due to referencing any memory location if the corresponding mask bit for that memory location is 0. For example, no faults will be detected if the mask bits are all zero.

pbrubaker avatar Jan 30 '25 11:01 pbrubaker

Consider the following example:

uniform int<3> max(uniform int<3> a, uniform int<3> b)
{
    uniform int<3> r;
    foreach (i = 0 ... 3) {
        r[i] = max(a[i], b[i]);
    }
	return r;
}

compiler explorer link: https://ispc.godbolt.org/z/rdPcKd9dT

When the size of int<3> is 4, the generated code is as follows:

max___uni_3C_3_3E_uni_3C_3_3E_:         # @max___uni_3C_3_3E_uni_3C_3_3E_
        vpxor   xmm2, xmm2, xmm2
        vpblendd        xmm0, xmm0, xmm2, 8             # xmm0 = xmm0[0,1,2],xmm2[3]
        vpblendd        xmm1, xmm1, xmm2, 8             # xmm1 = xmm1[0,1,2],xmm2[3]
        vpmaxsd xmm0, xmm0, xmm1
        ret
define <4 x i32> @max___uni_3C_3_3E_uni_3C_3_3E_(<4 x i32> %a, <4 x i32> %b, <4 x i1> %__mask) local_unnamed_addr #0 !dbg !31 {
allocas:
  %0 = insertelement <4 x i32> %a, i32 0, i64 3
  %1 = insertelement <4 x i32> %b, i32 0, i64 3
  %2 = tail call <4 x i32> @llvm.smax.v4i32(<4 x i32> %0, <4 x i32> %1)
  %v1.i115 = insertelement <4 x i32> %2, i32 undef, i64 3
  ret <4 x i32> %v1.i115, !dbg !58
}

When the size of int<3> is 3, the generated code is simpler:

max___uni_3C_3_3E_uni_3C_3_3E_:         # @max___uni_3C_3_3E_uni_3C_3_3E_
        vpmaxsd xmm0, xmm0, xmm1
        ret
define <3 x i32> @max___uni_3C_3_3E_uni_3C_3_3E_(<3 x i32> %a, <3 x i32> %b, <4 x i1> %__mask) local_unnamed_addr #0 {
allocas:
  %0 = call <3 x i32> @llvm.smax.v3i32(<3 x i32> %a, <3 x i32> %b)
  ret <3 x i32> %0
}

nurmukhametov avatar Feb 13 '25 18:02 nurmukhametov

It seems that the compiler thinks it needs to zero the 4th component when sizeof(uniform int<3>) == 4 but not when sizeof(uniform int<3>) == 3..

pbrubaker avatar Feb 16 '25 00:02 pbrubaker

Note that we changes this part of short vectors definition several times forth and back - every time by users requests. And every time we broke something. It's quite sensitive nuance that affects real user code.

As I understand it, rounding up has been in place since the beginning. There was a fix related to an issue (https://github.com/ispc/ispc/issues/1323), but removing the rounding up doesn't cause this test to fail. Moreover, testing ISPC with the VectorType::getVectorMemoryCount returning just elementCount.fixedCount for uniform short vectors shows no noticeable performance difference on GameDev benchmarks.

nurmukhametov avatar Feb 17 '25 14:02 nurmukhametov

Moreover, the size of a uniform short vector depends on the target. For example, uniform int<4> has 4 elements for AVX512 x4/x8/x16, but 8 for x32 and 16 for x64 (https://ispc.godbolt.org/z/P64hs69z9).

nurmukhametov avatar Feb 21 '25 12:02 nurmukhametov

Moreover, the size of a uniform short vector depends on the target. For example, uniform int<4> has 4 elements for AVX512 x4/x8/x16, but 8 for x32 and 16 for x64 (https://ispc.godbolt.org/z/P64hs69z9).

This is how it was as long as I can remember. uniform int<N> was always aligned to the vector width specified by the target. That's why it seems odd to me that the compiler is zeroing the 4th component of uniform int<3> in either case.

It also doesn't error on the case where foreach( i = 0 ... 4 ) here but also produces better code. https://ispc.godbolt.org/z/7sjE5sqxd

uniform int<3> max(uniform int<3> a, uniform int<3> b)
{
    uniform int<3> r;
    foreach (i = 0 ... 4) {
        r[i] = max(a[i], b[i]);
    }
	return r;
}
max___uni_3C_3_3E_uni_3C_3_3E_:         # @max___uni_3C_3_3E_uni_3C_3_3E_
        vpmaxsd xmm0, xmm0, xmm1
        ret

This part of the issue is related to #2321.

pbrubaker avatar Feb 23 '25 10:02 pbrubaker

It also doesn't error on the case where foreach( i = 0 ... 4 ) here but also produces better code.

There is no error even with foreach( i = 0 ... 80)in your example.

nurmukhametov avatar Feb 24 '25 15:02 nurmukhametov

It also doesn't error on the case where foreach( i = 0 ... 4 ) here but also produces better code.

There is no error even with foreach( i = 0 ... 80)in your example.

Oh my. That's definitely worse than I thought.

pbrubaker avatar Feb 27 '25 14:02 pbrubaker