ldc icon indicating copy to clipboard operation
ldc copied to clipboard

Array ops not as optimized as it could be

Open rikkimax opened this issue 3 months ago • 9 comments

Following: https://salykova.github.io/gemm-cpu I decided to have a play with array ops specifically for the kernel:

void kernel4(ref float[16] o, ref const float[16] a, ref const float[16] b, ref const float[16] c) {
    o[] = b[] * a[];
    o[] += c[];
}

For all of these examples you need -mattr=+avx512vl

Right now the codegen for this is abysmal it is nearly 90 lines full of vmovss and vmulss. Compare it to explicit simd types:

void kernel1(ref float16 o, ref const float16 a, ref const float16 b, ref const float16 c) {
    o = (b * a) + c;
}
_D7example7kernel1FKNhG16fKxNhQiKxQgKxQkZv:
.Lfunc_begin6:
        .cfi_startproc
        .loc    1 34 5 prologue_end
        vmovaps (%rdx), %ymm0
        vmovaps 32(%rdx), %ymm1
        vmovaps (%rsi), %ymm2
        vmovaps 32(%rsi), %ymm3
        vfmadd213ps     32(%rcx), %ymm1, %ymm3
        vfmadd213ps     (%rcx), %ymm0, %ymm2
        vmovaps %ymm2, (%rdi)
        vmovaps %ymm3, 32(%rdi)
        .loc    1 35 1
        vzeroupper
        retq

Quite different.

With some modifications to core.internal.array.operations we can get the codegen for kernel4 to be:

0000000000000000 <_D4main7kernel4FKG16fKxG16fKxQgKxQkZv>:
   0:   62 d1 7c 48 28 00       vmovaps (%r8),%zmm0
   6:   62 f1 7c 48 59 02       vmulps (%rdx),%zmm0,%zmm0
   c:   62 f1 7c 48 29 01       vmovaps %zmm0,(%rcx)
  12:   62 d1 7c 48 58 01       vaddps (%r9),%zmm0,%zmm0
  18:   62 f1 7c 48 29 01       vmovaps %zmm0,(%rcx)
  1e:   c5 f8 77                vzeroupper
  21:   c3                      ret

The modifications required are:

Turn on the SIMD helper code (gated by DigitalMars). Switch regsz from 16 to 64.

Swap out store+load, for their more generic pointer casts. *(cast(vec*)p) = val; return *cast(vec*)p;

Keep dmd's vectorizeable behavior.

With an additional forced inline of arrayOp:

void kernel3(ref float[16] o, ref const float[16] a, ref const float[16] b, ref const float[16] c) {
    o[] = (b[] * a[]) + c[];
}

We then get:

0000000000000000 <_D4main7kernel3FKG16fKxG16fKxQgKxQkZv>:
   0:   62 d1 7c 48 28 00       vmovaps (%r8),%zmm0
   6:   62 f1 7c 48 59 02       vmulps (%rdx),%zmm0,%zmm0
   c:   62 d1 7c 48 58 01       vaddps (%r9),%zmm0,%zmm0
  12:   62 f1 7c 48 29 01       vmovaps %zmm0,(%rcx)
  18:   c5 f8 77                vzeroupper
  1b:   c3                      ret

It looks like llvm cannot combine the multiply + add, but this is significantly better than what it is currently.

Thoughts?

rikkimax avatar Oct 05 '25 03:10 rikkimax

With --ffast-math:

0000000000000000 <_D4main7kernel3FKG16fKxG16fKxQgKxQkZv>:
   0:   62 d1 7c 48 28 00       vmovaps (%r8),%zmm0
   6:   62 f1 7c 48 28 0a       vmovaps (%rdx),%zmm1
   c:   62 d2 7d 48 a8 09       vfmadd213ps (%r9),%zmm0,%zmm1
  12:   62 f1 7c 48 29 09       vmovaps %zmm1,(%rcx)
  18:   c5 f8 77                vzeroupper
  1b:   c3                      ret

We then get the desired instruction.

rikkimax avatar Oct 05 '25 03:10 rikkimax

Manual loop unrolling isn't enough.

    static foreach(step; [1, 4, 8, 16]) {
        for(; pos + step <= res.length;) {
            static foreach(_; 0 .. step) {
                mixin(scalarExp!Args ~ ";");
                pos++;
            }
        }
    }

That gets us to vfmadd213ss which is float[1], not good enough.

rikkimax avatar Oct 05 '25 04:10 rikkimax

did you use -O3? -O is -O2, a lot of the more expensive vectorisation happens at -O3.

thewilsonator avatar Oct 05 '25 08:10 thewilsonator

did you use -O3? -O is -O2, a lot of the more expensive vectorisation happens at -O3.

-O3

rikkimax avatar Oct 05 '25 08:10 rikkimax

What does the equivalent C code do when compiled with clang?

thewilsonator avatar Oct 05 '25 08:10 thewilsonator

C/C++ does not have array vector ops.

https://dlang.org/spec/arrays.html#array-operations

rikkimax avatar Oct 05 '25 08:10 rikkimax

I guess what's missing for LLVM to auto-vectorize these ops is the @restrict (well, at least in such an isolated case, without it being inlined and the restrict potentially being inferred from the call context). I've never looked at core.internal.array.operations before; so it does a manual vectorization for DMD, assuming that the slices never overlap.

And indeed, with LDC v1.41:

import ldc.attributes;
void kernel4_restrict(@restrict ref float[16] o, @restrict ref const float[16] a, @restrict ref const float[16] b, ref const float[16] c) {
    o[] = (b[] * a[]) + c[];
}

=> -O -mattr=+avx512vl -ffast-math (https://d.godbolt.org/z/z6xGz7Kn6):

void example.kernel4_restrict(ref float[16], ref const(float[16]), ref const(float[16]), ref const(float[16])):
        vmovups zmm0, zmmword ptr [rdx]
        vmovups zmm1, zmmword ptr [rsi]
        vfmadd213ps     zmm1, zmm0, zmmword ptr [rcx]
        vmovups zmmword ptr [rdi], zmm1
        vzeroupper
        ret

So if the arrayops-spec requires non-overlapping slices (haven't checked), then we should probably add @restrict in the druntime module.

kinke avatar Oct 05 '25 10:10 kinke

I don't see a place in array ops module to put @restrict.

The kernel function I used above is user code.

https://github.com/ldc-developers/ldc/blob/32f6b5ba62429335c92a260265ae4060ce8342da/runtime/druntime/src/core/internal/array/operations.d#L36

The args parameter is where @restrict would need to go and its not supported.

rikkimax avatar Oct 05 '25 10:10 rikkimax

https://dlang.org/spec/arrays.html#array-operations, 13.12.6:

The slice on the left and any slices on the right must not overlap. All operands are evaluated exactly once, even if the array slice has zero elements in it.

So all good on the spec front.

The args parameter is where @restrict would need to go and its not supported.

I'm not sure if the problem is the tuple, and/or the types being slices. @restrict is only supported for pointer/ref parameters, not for slices; we'd have to unpack slices as separate length+pointer params, and applying @restrict to the pointer then.

Edit: The tuple is fine, so the problem are the slice types. E.g., this is vectorized:

import ldc.attributes : restrict;
import core.internal.traits : Filter;

T arrayOp(T, Args...)(@restrict ref T res, @restrict ref Filter!(isType, Args) args) {
    foreach (i; 0 .. res.length)
        res[i] = args[0][i] * args[1][i];
    return res;
}

enum isType(T) = true;
enum isType(alias a) = false;

void bla(ref float[16] o, ref const float[16] a, ref const float[16] b) {
    arrayOp!(float[16], const float[16], const float[16])(o, a, b);
}

=>

void example.bla(ref float[16], ref const(float[16]), ref const(float[16])):
        vmovups zmm0, zmmword ptr [rdx]
        vmulps  zmm0, zmm0, zmmword ptr [rsi]
        vmovups zmmword ptr [rdi], zmm0
        vzeroupper
        ret

kinke avatar Oct 05 '25 10:10 kinke