highway icon indicating copy to clipboard operation
highway copied to clipboard

unexpected emu128 failures

Open mo271 opened this issue 2 years ago • 11 comments

After a recent highway upgrade https://github.com/libjxl/libjxl/pull/1452, we are observing some unexpected failures in tests with the highway target emu128 on 32 bit architectures, for example

[----------] 1 test from HwyMulTestGroup/HwyMulTest
[ RUN      ] HwyMulTestGroup/HwyMulTest.TestAllMulHigh/Emu128
i16x8 expect [0+ ->]:
  0x3FFF,0x0FFF,0x03FF,0x00FF,0x003F,0x000F,0x0003,
i16x8 actual [0+ ->]:
  0xBFFF,0x0FFF,0xE400,0x00FF,0xF840,0x000F,0xFE04,
Abort at ./third_party/highway/hwy/tests/mul_test.cc:131: Emu128, i16x8 lane 0 mismatch: expected '0x3FFF', got '0xBFFF'.

see here: https://github.com/libjxl/libjxl/runs/6916936967?check_suite_focus=true Also see the discussion on https://github.com/libjxl/libjxl/pull/1500.

Two questions:

  1. Shouldn't the tests still pass when setting the tolerance to 0, as done here: https://github.com/libjxl/libjxl/runs/6917084785?check_suite_focus=true?
  2. Is the HwyMulTestGroup/HwyMulTest.TestAllMulHigh/Emu128 above something that needs to be fixed upstream?

mo271 avatar Jun 16 '22 13:06 mo271

Interesting. It seems the libjxl pipelines succeeded at the time of upgrade. I'm curious what caused it to change - perhaps a compiler update?

When running on Linux x86, I see the same expected values, so that's good. Seems to be the implementation of MulHigh. And indeed it's missing, for the int16 case, a cast to int32. Will add that shortly.

jan-wassenberg avatar Jun 16 '22 14:06 jan-wassenberg

Interesting. It seems the libjxl pipelines succeeded at the time of upgrade. I'm curious what caused it to change - perhaps a compiler update?

Actually, "Build/Test / Windows MSYS2 / i686 (push)" has been failing ever since the upgrade.

Thanks for the quick fix!

mo271 avatar Jun 17 '22 06:06 mo271

You're welcome :)

Here's the "Windows MSYS2 / i686" I looked at: https://github.com/libjxl/libjxl/runs/6626839051?check_suite_focus=true

But there are indeed failing tests: https://github.com/libjxl/libjxl/runs/6919658022?check_suite_focus=true

It seems the "passing" tests were build-only, not actually running those tests?

jan-wassenberg avatar Jun 17 '22 07:06 jan-wassenberg

It seems the "passing" tests were build-only, not actually running those tests?

Yes, exactly.

mo271 avatar Jun 17 '22 07:06 mo271

There are also asan failures with emu128:

Note: Google Test filter = ConvolveTestGroup/ConvolveTest.TestConvolve/Emu128
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from ConvolveTestGroup/ConvolveTest
[ RUN      ] ConvolveTestGroup/ConvolveTest.TestConvolve/Emu128
AddressSanitizer:DEADLYSIGNAL
=================================================================
==19929==ERROR: AddressSanitizer: SEGV on unknown address 0x7f04a7eb344c (pc 0x5603b6203895 bp 0x000000001200 sp 0x7f00ee634d60 T4)
==19929==The signal is caused by a READ memory access.
    #0 0x5603b6203894 in __sanitizer::CombinedAllocator<__sanitizer::SizeClassAllocator64<__asan::AP64>, __sanitizer::SizeClassAllocatorLocalCache<__sanitizer::SizeClassAllocator64<__asan::AP64> >, __sanitizer::LargeMmapAllocator<__asan::AsanMapUnmapCallback, __sanitizer::LargeMmapAllocatorPtrArrayDynamic> >::Allocate(__sanitizer::SizeClassAllocatorLocalCache<__sanitizer::SizeClassAllocator64<__asan::AP64> >*, unsigned long, unsigned long) (/home/runner/work/libjxl/libjxl/build/lib/tests/convolve_test+0xbf894)
    #1 0x5603b61ff053 in __asan::Allocator::Allocate(unsigned long, unsigned long, __sanitizer::BufferedStackTrace*, __asan::AllocType, bool) (/home/runner/work/libjxl/libjxl/build/lib/tests/convolve_test+0xbb053)
    #2 0x5603b61fedc9 in __asan::asan_malloc(unsigned long, __sanitizer::BufferedStackTrace*) (/home/runner/work/libjxl/libjxl/build/lib/tests/convolve_test+0xbadc9)
    #3 0x5603b62a5071 in __interceptor_malloc (/home/runner/work/libjxl/libjxl/build/lib/tests/convolve_test+0x161071)
    #4 0x5603b62fbc34 in jxl::CacheAligned::Allocate(unsigned long, unsigned long) /home/runner/work/libjxl/libjxl/lib/jxl/base/cache_aligned.cc:87:21
    #5 0x5603b6e93fcc in jxl::CacheAligned::Allocate(unsigned long) /home/runner/work/libjxl/libjxl/lib/jxl/base/cache_aligned.h:43:12
    #6 0x5603b6e93fcc in jxl::AllocateArray(unsigned long) /home/runner/work/libjxl/libjxl/lib/jxl/base/cache_aligned.h:61
    #7 0x5603b6e93fcc in jxl::PlaneBase::PlaneBase(unsigned long, unsigned long, unsigned long) /home/runner/work/libjxl/libjxl/lib/jxl/image.cc:93

Not entirely sure if this is related, but it also started happening consistently after the highway upgrade. https://github.com/libjxl/libjxl/runs/6932368369?check_suite_focus=true

mo271 avatar Jun 17 '22 09:06 mo271

Interesting. It seems the libjxl pipelines succeeded at the time of upgrade. I'm curious what caused it to change - perhaps a compiler update?

When running on Linux x86, I see the same expected values, so that's good. Seems to be the implementation of MulHigh. And indeed it's missing, for the int16 case, a cast to int32. Will add that shortly.

Not sure if the MulHigh implementation is really relevant to this failure.

At least it seems like it didn't fix the problem, with 7a906ec139625de5ea3866603269cde82bdc761e, we still get the same failure:

Running main() from C:/M/mingw-w64-gtest/src/googletest-release-1.11.0/googletest/src/gtest_main.cc
Note: Google Test filter = HwyMulTestGroup/HwyMulTest.TestAllMulHigh/Emu128
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from HwyMulTestGroup/HwyMulTest
[ RUN      ] HwyMulTestGroup/HwyMulTest.TestAllMulHigh/Emu128
i16x8 expect [0+ ->]:
  0x3FFF,0x0FFF,0x03FF,0x00FF,0x003F,0x000F,0x0003,
i16x8 actual [0+ ->]:
  0xBFFF,0x0FFF,0xE400,0x00FF,0xF840,0x000F,0xFE04,

mo271 avatar Jun 20 '22 14:06 mo271

hm, it might indeed be a compiler issue. I see both convolve_test and mul_test working with ci.sh asan (clang 13). What's running on the builder is clang-7. I see two options: updating the compiler on the builder, or opting out of EMU128 (and reverting to HWY_SCALAR) on certain platforms - any preference?

jan-wassenberg avatar Jun 20 '22 15:06 jan-wassenberg

hm, it might indeed be a compiler issue. I see both convolve_test and mul_test working with ci.sh asan (clang 13). What's running on the builder is clang-7. I see two options: updating the compiler on the builder, or opting out of EMU128 (and reverting to HWY_SCALAR) on certain platforms - any preference?

On the failing "Windows MSYS2 / i686" test, the compiler seems to be mingw-w64-i686-gcc-12.1.0-2. Perhaps it is a bug in gcc 12.1.0-2? I will try with clang instead and also try reverting to HWY_SCALAR on that platform.

mo271 avatar Jun 21 '22 08:06 mo271

SG. FYI I've filed a bug for an unrelated GCC 12 issue - build timeout for Arm NEON.

jan-wassenberg avatar Jun 21 '22 08:06 jan-wassenberg

I believe I can reproduce the issue on Linux/i686/gcc-12:

  • https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106322

malaterre avatar Jul 16 '22 12:07 malaterre

Something that is odd is that armel (32bits) is also affects (exact same symptoms), compare:

  • https://buildd.debian.org/status/fetch.php?pkg=highway&arch=i386&ver=0.17.1%7Egit20220711.f0a396a-1&stamp=1657558093&raw=0

and

  • https://buildd.debian.org/status/fetch.php?pkg=highway&arch=armel&ver=0.17.1%7Egit20220711.f0a396a-1&stamp=1657559820&raw=0

malaterre avatar Jul 16 '22 12:07 malaterre

This has been fixed on GCC 12.x branch. I suspect we can close this bug:

  • https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106322#c48

malaterre avatar Aug 25 '22 07:08 malaterre

Thanks for following up 👍 Should we mark EMU128 as broken on GCC<12?

jan-wassenberg avatar Aug 29 '22 10:08 jan-wassenberg

Thanks for following up 👍 Should we mark EMU128 as broken on GCC<12?

If you handle minor version, it should read as GCC<12.3

malaterre avatar Aug 29 '22 11:08 malaterre

Thanks, changing to GCC 12.3 :)

jan-wassenberg avatar Aug 29 '22 13:08 jan-wassenberg