Short uniform vector real size
It looks like the short vector size is actually rounded up to the next power of 2. It is probably done for small sizes due to some performance concerns (e.g., 3 to 4), despite the fact that it is error prone (easy to write an out-of-bound read/write). If short vector has bigger size then it also rounded up. For the example below (https://godbolt.org/z/1P7TE961o), the real size if 512 instead of expected 400.
uniform int foo() {
return sizeof(uniform float<100>);
}
foo___: # @foo___
mov eax, 512
ret
Rounding is implemented here https://github.com/ispc/ispc/blob/80dccf9c2ff614cb5e144250757e80b1d2263f3e/src/type.cpp#L1992-L2004
Note that we changes this part of short vectors definition several times forth and back - every time by users requests. And every time we broke something. It's quite sensitive nuance that affects real user code.
The problem that I encountered that the deference of uniform float<N> *uniform is often out-of-bound read/write. Consider the following example (https://godbolt.org/z/nYbc5Pa3v):
#define N 10
uniform float<N> foo(uniform float<N> *uniform ptr) {
return *ptr;
}
The generated code actually reads 16 floats instead of 10.
foo___un_3C_unf_3C_10_3E__3E_: # @foo___un_3C_unf_3C_10_3E__3E_
vmovaps ymm0, ymmword ptr [rdi]
vmovaps ymm1, ymmword ptr [rdi + 32]
ret
Here we have memory errors for all N values except powers of 2.
I think there are two different things here to consider:
- the size the data occupies when it's a part of a
struct- i.e. how much padding it needs - how much data is actually read. Here it's a bit tricky, as reading smaller number of values (for correctness) may force code gen to use slower reads - you cannot use vector loads, even masked (I think), as in this case the full vector size needs to be loaded and masking is applied afterwards triggering all out of bound faults anyways. So this is done to enable vector loads/stores for short vectors.
you cannot use vector loads, even masked (I think), as in this case the full vector size needs to be loaded and masking is applied afterwards triggering all out of bound faults anyways.
Masked movs should be fine.
Consider the following example:
uniform int<3> max(uniform int<3> a, uniform int<3> b)
{
uniform int<3> r;
foreach (i = 0 ... 3) {
r[i] = max(a[i], b[i]);
}
return r;
}
compiler explorer link: https://ispc.godbolt.org/z/rdPcKd9dT
When the size of int<3> is 4, the generated code is as follows:
max___uni_3C_3_3E_uni_3C_3_3E_: # @max___uni_3C_3_3E_uni_3C_3_3E_
vpxor xmm2, xmm2, xmm2
vpblendd xmm0, xmm0, xmm2, 8 # xmm0 = xmm0[0,1,2],xmm2[3]
vpblendd xmm1, xmm1, xmm2, 8 # xmm1 = xmm1[0,1,2],xmm2[3]
vpmaxsd xmm0, xmm0, xmm1
ret
define <4 x i32> @max___uni_3C_3_3E_uni_3C_3_3E_(<4 x i32> %a, <4 x i32> %b, <4 x i1> %__mask) local_unnamed_addr #0 !dbg !31 {
allocas:
%0 = insertelement <4 x i32> %a, i32 0, i64 3
%1 = insertelement <4 x i32> %b, i32 0, i64 3
%2 = tail call <4 x i32> @llvm.smax.v4i32(<4 x i32> %0, <4 x i32> %1)
%v1.i115 = insertelement <4 x i32> %2, i32 undef, i64 3
ret <4 x i32> %v1.i115, !dbg !58
}
When the size of int<3> is 3, the generated code is simpler:
max___uni_3C_3_3E_uni_3C_3_3E_: # @max___uni_3C_3_3E_uni_3C_3_3E_
vpmaxsd xmm0, xmm0, xmm1
ret
define <3 x i32> @max___uni_3C_3_3E_uni_3C_3_3E_(<3 x i32> %a, <3 x i32> %b, <4 x i1> %__mask) local_unnamed_addr #0 {
allocas:
%0 = call <3 x i32> @llvm.smax.v3i32(<3 x i32> %a, <3 x i32> %b)
ret <3 x i32> %0
}
It seems that the compiler thinks it needs to zero the 4th component when sizeof(uniform int<3>) == 4 but not when sizeof(uniform int<3>) == 3..
Note that we changes this part of short vectors definition several times forth and back - every time by users requests. And every time we broke something. It's quite sensitive nuance that affects real user code.
As I understand it, rounding up has been in place since the beginning. There was a fix related to an issue (https://github.com/ispc/ispc/issues/1323), but removing the rounding up doesn't cause this test to fail. Moreover, testing ISPC with the VectorType::getVectorMemoryCount returning just elementCount.fixedCount for uniform short vectors shows no noticeable performance difference on GameDev benchmarks.
Moreover, the size of a uniform short vector depends on the target. For example, uniform int<4> has 4 elements for AVX512 x4/x8/x16, but 8 for x32 and 16 for x64 (https://ispc.godbolt.org/z/P64hs69z9).
Moreover, the size of a uniform short vector depends on the target. For example,
uniform int<4>has 4 elements for AVX512 x4/x8/x16, but 8 for x32 and 16 for x64 (https://ispc.godbolt.org/z/P64hs69z9).
This is how it was as long as I can remember. uniform int<N> was always aligned to the vector width specified by the target. That's why it seems odd to me that the compiler is zeroing the 4th component of uniform int<3> in either case.
It also doesn't error on the case where foreach( i = 0 ... 4 ) here but also produces better code. https://ispc.godbolt.org/z/7sjE5sqxd
uniform int<3> max(uniform int<3> a, uniform int<3> b)
{
uniform int<3> r;
foreach (i = 0 ... 4) {
r[i] = max(a[i], b[i]);
}
return r;
}
max___uni_3C_3_3E_uni_3C_3_3E_: # @max___uni_3C_3_3E_uni_3C_3_3E_
vpmaxsd xmm0, xmm0, xmm1
ret
This part of the issue is related to #2321.
It also doesn't error on the case where
foreach( i = 0 ... 4 )here but also produces better code.
There is no error even with foreach( i = 0 ... 80)in your example.
It also doesn't error on the case where
foreach( i = 0 ... 4 )here but also produces better code.There is no error even with
foreach( i = 0 ... 80)in your example.
Oh my. That's definitely worse than I thought.