ldc icon indicating copy to clipboard operation
ldc copied to clipboard

Test failures on linux aarch64

Open the-horo opened this issue 1 year ago • 3 comments
trafficstars

Some tests fail on my Raspberry Pi 4 running Gentoo.

The first issue is with std.internal.math.gammafunction:

ctest -R 'std.internal.math.gammafunction'
Test project /home/ldc/build
    Start  377: std.internal.math.gammafunction
1/4 Test  #377: std.internal.math.gammafunction ................***Failed    0.02 sec
    Start  812: std.internal.math.gammafunction-debug
2/4 Test  #812: std.internal.math.gammafunction-debug ..........***Failed    5.70 sec
    Start 1247: std.internal.math.gammafunction-shared
3/4 Test #1247: std.internal.math.gammafunction-shared .........***Failed    0.04 sec
    Start 1682: std.internal.math.gammafunction-debug-shared
4/4 Test #1682: std.internal.math.gammafunction-debug-shared ...***Failed    0.07 sec

This is because https://github.com/dlang/phobos/blob/14b23633b762cfd7b03614dca4c6b0cafa1016e5/std/internal/math/gammafunction.d#L396 contains real.mant_dig which on my PC is 64 but on the PI it is 113. I guess the solution is to change the value to a constant and not have it be dependent on real since this type varies by platform, but I'm not 100% sure what to do.

Secondly std.internal.exponential:

ctest -R 'std.math.exponential'
Test project /home/ldc/build
    Start  397: std.math.exponential
1/4 Test  #397: std.math.exponential ................***Failed    0.02 sec
    Start  832: std.math.exponential-debug
2/4 Test  #832: std.math.exponential-debug ..........   Passed    0.03 sec
    Start 1267: std.math.exponential-shared
3/4 Test #1267: std.math.exponential-shared .........***Failed    0.04 sec
    Start 1702: std.math.exponential-debug-shared
4/4 Test #1702: std.math.exponential-debug-shared ...   Passed    0.07 sec

Which is weird since only the release builds fail. I've tried to minimize the test case to:

real pow(real x, real y) @trusted @nogc pure nothrow
{
        long iy = cast(long) y;
        //assert(iy != y);
        if (iy == y) {
                assert(false);
        }
        assert(false);
}

void main () {
        import std.math.traits : isNaN;
        assert(isNaN(pow(-1.0L, 1/real.epsilon - 0.5L)));
}

The problem with this one is that the assert fails on different lines based on optimizations:

../bin/ldc2 -O -run repro.d
[email protected](6): Assertion failure
----------------
??:? [0x556070c0bf]
??:? [0x556070bd0b]
??:? [0x556072f28f]
??:? [0x5560711b6b]
??:? [0x556070a9d7]
??:? [0x5560709faf]
??:? [0x5560711837]
??:? [0x556071171b]
??:? [0x5560711587]
??:? [0x7fbd44738b]
??:? __libc_start_main [0x7fbd44745f]
??:? [0x5560709eaf]
Error: /tmp/repro-d8ac08 failed with status: 1

and

./bin/ldc2 -run repro.d
[email protected](8): Assertion failure
----------------
??:? [0x557b4cc1db]
??:? [0x557b4cbe27]
??:? [0x557b4ef3ab]
??:? [0x557b4d1c87]
??:? [0x557b4caaf3]
??:? [0x557b4ca027]
??:? [0x557b4ca043]
??:? [0x557b4d1953]
??:? [0x557b4d1837]
??:? [0x557b4d16a3]
??:? [0x557b4ca0cb]
??:? [0x7f9668738b]
??:? __libc_start_main [0x7f9668745f]
??:? [0x557b4c9eaf]
Error: /tmp/repro-1c99cf failed with status: 1

It's possible that bad code is generated. Another thing, I don't know how helpful, is that uncommenting the //assert line makes the optimized build not go into the if (and fail normally).

The last problem I had was with core.thread.fiber:

ctest --timeout 10 -R 'core.thread.fiber'
Test project /home/ldc/build
    Start  130: core.thread.fiber
1/4 Test  #130: core.thread.fiber ................***Timeout  10.03 sec
    Start  565: core.thread.fiber-debug
2/4 Test  #565: core.thread.fiber-debug ..........   Passed    0.26 sec
    Start 1000: core.thread.fiber-shared
3/4 Test #1000: core.thread.fiber-shared .........***Timeout  10.02 sec
    Start 1435: core.thread.fiber-debug-shared
4/4 Test #1435: core.thread.fiber-debug-shared ...   Passed    0.29 sec

The release builds hang. Running the tests directly I get:

./runtime/druntime-test-runner core.thread.fiber
Not safe to migrate Fibers between Threads on your system. Consider setting version CheckFiberMigration for this system in thread.d
****** FAIL release64 core.thread.fiber
core.exception.AssertError@core/thread/fiber.d(1078): HOLD != EXEC
----------------
??:? [0x559245bb43]
??:? [0x559245b527]
??:? [0x55924ca597]
??:? [0x55924cbc3b]
??:? [0x55923af9e3]
??:? [0x559247070b]
??:? [0x559246c77f]
??:? [0x55924ef213]
^C
./runtime/druntime-test-runner-shared core.thread.fiber
Not safe to migrate Fibers between Threads on your system. Consider setting version CheckFiberMigration for this system in thread.d
****** FAIL release64 core.thread.fiber
core.exception.AssertError@core/thread/fiber.d(1077): Fiber.yield() called with no active fiber
----------------
??:? _d_assert_msg [0x7fbf10bbc3]
??:? void core.thread.fiber.TestFiber.run() [0x7fbf1cc31f]
??:? fiber_entryPoint [0x7fbf1c83bb]
??:? [0x7fbf24ca23]
^C

Enabling ChekFiberMigration doesn't solve it:

./runtime/druntime-test-runner-shared core.thread.fiber
****** FAIL release64 core.thread.fiber
core.exception.AssertError@core/thread/fiber.d(1078): Fiber.yield() called with no active fiber
----------------
??:? _d_assert_msg [0x7f8555bbe3]
??:? void core.thread.fiber.TestFiber.run() [0x7f8561c3cb]
??:? fiber_entryPoint [0x7f856183db]
??:? [0x7f8569cb43]
^C

Better, I'm also getting segmentation faults sometime:

./runtime/druntime-test-runner-shared core.thread.fiber
/home/ldc/build/lib/libdruntime-ldc-unittest-shared.so.108(_D4core7runtime18runModuleUnitTestsUZ19unittestSegvHandlerUNbNiiPSQCm3sys5posix6signal9siginfo_tPvZv+0x24)[0x7f81537500]
linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0x7f8167d7a8]
/home/ldc/build/lib/libdruntime-ldc-unittest-shared.so.108(_D4core6thread5fiber5Fiber9switchOutMFNbNiZv+0x1c)[0x7f8154b458]
/home/ldc/build/lib/libdruntime-ldc-unittest-shared.so.108(_D4core6thread5fiber9TestFiber3runMFZv+0x5c)[0x7f8154c38c]
/home/ldc/build/lib/libdruntime-ldc-unittest-shared.so.108(fiber_entryPoint+0x68)[0x7f815483dc]
/home/ldc/build/lib/libdruntime-ldc-unittest-shared.so.108(+0x1dcb44)[0x7f815ccb44]
Segmentation fault (core dumped)

I have no idea how to approach fixing this one.

the-horo avatar Apr 01 '24 20:04 the-horo

This is a subset of https://github.com/ldc-developers/ldc/blob/e170ca51e673c258437915a8aff3feb31ea6802b/.cirrus.yml#L56-L64 (been a while since checking if there's been any improvements), which is more up-to-date than https://github.com/ldc-developers/ldc/issues/2153#issuecomment-626028446 from the AArch64 tracker issue.

IIRC, the math issues boil down to 2 problems - one being slightly incomplete 128-bit quadruple-precision real support in upstream Phobos, and something special wrt. optimized code and NaNs on AArch64 (not preserving the NaN payload or something like that).

The sporadic core.thread.fiber failures with enabled optimizations happen on macOS arm64 too, contrary to the math issues (Apple uses 64-bit real, on AArch64 too).

kinke avatar Apr 01 '24 23:04 kinke

Thanks for the new links, I should have looked a bit harder before opening the issues.

IIRC, the math issues boil down to 2 problems - one being slightly incomplete 128-bit quadruple-precision real support in upstream Phobos, and something special wrt. optimized code and NaNs on AArch64 (not preserving the NaN payload or something like that).

The std.math.exponential bug doesn't look like it's using NaNs, the failure is caused by that if statement not being skipped, even though the value of y is 5.1923e+33 which shouldn't be representable as a ulong. I don't know how floating point numbers works in assembly though, much more on aarch64, so I won't try to dissect this issue further.

One last thing, should CheckFiberMigration be set for all aarch64 systems, not just Apple since I was getting that warning or is it safe to ignore?

the-horo avatar Apr 02 '24 04:04 the-horo

FYI it is this unittest in core.thread.fiber that is causing trouble on AArch64 (both macOS, and linux-musl): https://github.com/ldc-developers/ldc/blob/40ad5f7583fe90a85fd675bfc0cfc287d565c94a/runtime/druntime/src/core/thread/fiber.d#L2303-L2350

    version(AArch64)
        return;

fixes things for me. (both on macOS and linux musl)

JohanEngelen avatar May 05 '24 22:05 JohanEngelen