Array ops not as optimized as it could be
Following: https://salykova.github.io/gemm-cpu I decided to have a play with array ops specifically for the kernel:
void kernel4(ref float[16] o, ref const float[16] a, ref const float[16] b, ref const float[16] c) {
o[] = b[] * a[];
o[] += c[];
}
For all of these examples you need -mattr=+avx512vl
Right now the codegen for this is abysmal it is nearly 90 lines full of vmovss and vmulss.
Compare it to explicit simd types:
void kernel1(ref float16 o, ref const float16 a, ref const float16 b, ref const float16 c) {
o = (b * a) + c;
}
_D7example7kernel1FKNhG16fKxNhQiKxQgKxQkZv:
.Lfunc_begin6:
.cfi_startproc
.loc 1 34 5 prologue_end
vmovaps (%rdx), %ymm0
vmovaps 32(%rdx), %ymm1
vmovaps (%rsi), %ymm2
vmovaps 32(%rsi), %ymm3
vfmadd213ps 32(%rcx), %ymm1, %ymm3
vfmadd213ps (%rcx), %ymm0, %ymm2
vmovaps %ymm2, (%rdi)
vmovaps %ymm3, 32(%rdi)
.loc 1 35 1
vzeroupper
retq
Quite different.
With some modifications to core.internal.array.operations we can get the codegen for kernel4 to be:
0000000000000000 <_D4main7kernel4FKG16fKxG16fKxQgKxQkZv>:
0: 62 d1 7c 48 28 00 vmovaps (%r8),%zmm0
6: 62 f1 7c 48 59 02 vmulps (%rdx),%zmm0,%zmm0
c: 62 f1 7c 48 29 01 vmovaps %zmm0,(%rcx)
12: 62 d1 7c 48 58 01 vaddps (%r9),%zmm0,%zmm0
18: 62 f1 7c 48 29 01 vmovaps %zmm0,(%rcx)
1e: c5 f8 77 vzeroupper
21: c3 ret
The modifications required are:
Turn on the SIMD helper code (gated by DigitalMars).
Switch regsz from 16 to 64.
Swap out store+load, for their more generic pointer casts. *(cast(vec*)p) = val; return *cast(vec*)p;
Keep dmd's vectorizeable behavior.
With an additional forced inline of arrayOp:
void kernel3(ref float[16] o, ref const float[16] a, ref const float[16] b, ref const float[16] c) {
o[] = (b[] * a[]) + c[];
}
We then get:
0000000000000000 <_D4main7kernel3FKG16fKxG16fKxQgKxQkZv>:
0: 62 d1 7c 48 28 00 vmovaps (%r8),%zmm0
6: 62 f1 7c 48 59 02 vmulps (%rdx),%zmm0,%zmm0
c: 62 d1 7c 48 58 01 vaddps (%r9),%zmm0,%zmm0
12: 62 f1 7c 48 29 01 vmovaps %zmm0,(%rcx)
18: c5 f8 77 vzeroupper
1b: c3 ret
It looks like llvm cannot combine the multiply + add, but this is significantly better than what it is currently.
Thoughts?
With --ffast-math:
0000000000000000 <_D4main7kernel3FKG16fKxG16fKxQgKxQkZv>:
0: 62 d1 7c 48 28 00 vmovaps (%r8),%zmm0
6: 62 f1 7c 48 28 0a vmovaps (%rdx),%zmm1
c: 62 d2 7d 48 a8 09 vfmadd213ps (%r9),%zmm0,%zmm1
12: 62 f1 7c 48 29 09 vmovaps %zmm1,(%rcx)
18: c5 f8 77 vzeroupper
1b: c3 ret
We then get the desired instruction.
Manual loop unrolling isn't enough.
static foreach(step; [1, 4, 8, 16]) {
for(; pos + step <= res.length;) {
static foreach(_; 0 .. step) {
mixin(scalarExp!Args ~ ";");
pos++;
}
}
}
That gets us to vfmadd213ss which is float[1], not good enough.
did you use -O3? -O is -O2, a lot of the more expensive vectorisation happens at -O3.
did you use
-O3?-Ois-O2, a lot of the more expensive vectorisation happens at-O3.
-O3
What does the equivalent C code do when compiled with clang?
C/C++ does not have array vector ops.
https://dlang.org/spec/arrays.html#array-operations
I guess what's missing for LLVM to auto-vectorize these ops is the @restrict (well, at least in such an isolated case, without it being inlined and the restrict potentially being inferred from the call context). I've never looked at core.internal.array.operations before; so it does a manual vectorization for DMD, assuming that the slices never overlap.
And indeed, with LDC v1.41:
import ldc.attributes;
void kernel4_restrict(@restrict ref float[16] o, @restrict ref const float[16] a, @restrict ref const float[16] b, ref const float[16] c) {
o[] = (b[] * a[]) + c[];
}
=> -O -mattr=+avx512vl -ffast-math (https://d.godbolt.org/z/z6xGz7Kn6):
void example.kernel4_restrict(ref float[16], ref const(float[16]), ref const(float[16]), ref const(float[16])):
vmovups zmm0, zmmword ptr [rdx]
vmovups zmm1, zmmword ptr [rsi]
vfmadd213ps zmm1, zmm0, zmmword ptr [rcx]
vmovups zmmword ptr [rdi], zmm1
vzeroupper
ret
So if the arrayops-spec requires non-overlapping slices (haven't checked), then we should probably add @restrict in the druntime module.
I don't see a place in array ops module to put @restrict.
The kernel function I used above is user code.
https://github.com/ldc-developers/ldc/blob/32f6b5ba62429335c92a260265ae4060ce8342da/runtime/druntime/src/core/internal/array/operations.d#L36
The args parameter is where @restrict would need to go and its not supported.
https://dlang.org/spec/arrays.html#array-operations, 13.12.6:
The slice on the left and any slices on the right must not overlap. All operands are evaluated exactly once, even if the array slice has zero elements in it.
So all good on the spec front.
The args parameter is where
@restrictwould need to go and its not supported.
I'm not sure if the problem is the tuple, and/or the types being slices. @restrict is only supported for pointer/ref parameters, not for slices; we'd have to unpack slices as separate length+pointer params, and applying @restrict to the pointer then.
Edit: The tuple is fine, so the problem are the slice types. E.g., this is vectorized:
import ldc.attributes : restrict;
import core.internal.traits : Filter;
T arrayOp(T, Args...)(@restrict ref T res, @restrict ref Filter!(isType, Args) args) {
foreach (i; 0 .. res.length)
res[i] = args[0][i] * args[1][i];
return res;
}
enum isType(T) = true;
enum isType(alias a) = false;
void bla(ref float[16] o, ref const float[16] a, ref const float[16] b) {
arrayOp!(float[16], const float[16], const float[16])(o, a, b);
}
=>
void example.bla(ref float[16], ref const(float[16]), ref const(float[16])):
vmovups zmm0, zmmword ptr [rdx]
vmulps zmm0, zmm0, zmmword ptr [rsi]
vmovups zmmword ptr [rdi], zmm0
vzeroupper
ret