ldc
ldc copied to clipboard
codegen issue for vectors
It seems that ldc2 has a bit of trouble with IR generation on vectorized computations. I believe this could be connected to this issue in intel-intrinsics library: https://github.com/AuburnSounds/intel-intrinsics/issues/86 https://godbolt.org/z/aKh4rWY83
These two functions compile to the same single sqrt cpu instruction (on arm and intel targets), as expected:
auto sqrt(double2 a)
{
a.ptr[0] = llvm_sqrt(a.array[0]);
a.ptr[1] = llvm_sqrt(a.array[1]);
return a;
}
auto sqrt2(double2 a)
{
return __irEx_pure!(
`declare <2 x double> @llvm.sqrt.v2f64(<2 x double> %Val)`,
`%r = call <2 x double> @llvm.sqrt.v2f64(<2x double> %0)
ret <2 x double> %r`, "", double2)(a);
}
But when I am trying to use them in another function handling partial vectors then the first one causes problems:
auto get_low(double4 d4)
{
return __ir_pure!(
`%r = shufflevector <4 x double> %0, <4 x double> undef,
<2 x i32> <i32 0, i32 1>
ret <2 x double> %r`, double2)(d4);
}
auto sqrt(double4 d4)
{
return sqrt(get_low(d4));
}
auto sqrt2(double4 d4)
{
return sqrt2(get_low(d4));
}
Now we get:
pure nothrow @nogc __vector(double[2]) example.sqrt(__vector(double[4])):
fsqrt d1, d0
mov d0, v0.d[1]
fsqrt d0, d0
mov v1.d[1], v0.d[0]
mov v0.16b, v1.16b
ret
pure nothrow @nogc @safe __vector(double[2]) example.sqrt2(__vector(double[4])):
fsqrt v0.2d, v0.2d
ret
This issue seems quite problematic for aarch64 (or any target without 256 bit vectors), where double4 is really two double2 registers. Yet it does not seem possible to take a function that operates on double2 and apply it twice on two chunks of double4.
Note that llvm_sqrt
supports vectors, so this should be as simple as:
import ldc.intrinsics : llvm_sqrt;
import core.simd;
auto sqrt2(double2 x) { return llvm_sqrt(x); }
auto sqrt4(double4 x) { return llvm_sqrt(x); }
https://run.dlang.io/is/A9OILJ
oh, I was using sqrt just as an example.I also need log/exp and they exist in nice sse form. I was thinking about a generic template like
auto promote(alias fun)(double4 d4)
{
double2 low = get_low(d4);
double2 high = get_high(d4);
low = fun(low);
high = fun(high);
return combine(high, low);
}
There should be no reason for ldc to destroy double2 vectors and switch to per entry operations. And this does not happen when IR is used.
All is good if there is a vectorized intrinsic for something, but if not then we can run into something like this: https://github.com/AuburnSounds/intel-intrinsics/issues/86 https://github.com/AuburnSounds/intel-intrinsics/blob/master/source/inteli/emmintrin.d#L1222 Somehow optimizer forgets that there is a vector already and tries to dig inside the loop over vector entries.
I fail to see the problem in the generated IR for sqrt(double2)
. The scalar access goes through a reinterpret-as-array-then-index GEP which might trip up the optimizer, but that makes it clearly an LLVM issue IMO.
ah, maybe this is an LLVM issue. This is quite far over my head, I am afraid. The end result for double4 is weird. How should I proceed? What would be a good LLVM IR thing to submit as LLVM issue?
clang does not have a similar issue, according to this: https://github.com/AuburnSounds/intel-intrinsics/issues/86#issuecomment-997116210