highway
highway copied to clipboard
unexpected emu128 failures
After a recent highway upgrade https://github.com/libjxl/libjxl/pull/1452, we are observing some unexpected failures in tests with the highway target emu128 on 32 bit architectures, for example
[----------] 1 test from HwyMulTestGroup/HwyMulTest
[ RUN ] HwyMulTestGroup/HwyMulTest.TestAllMulHigh/Emu128
i16x8 expect [0+ ->]:
0x3FFF,0x0FFF,0x03FF,0x00FF,0x003F,0x000F,0x0003,
i16x8 actual [0+ ->]:
0xBFFF,0x0FFF,0xE400,0x00FF,0xF840,0x000F,0xFE04,
Abort at ./third_party/highway/hwy/tests/mul_test.cc:131: Emu128, i16x8 lane 0 mismatch: expected '0x3FFF', got '0xBFFF'.
see here: https://github.com/libjxl/libjxl/runs/6916936967?check_suite_focus=true Also see the discussion on https://github.com/libjxl/libjxl/pull/1500.
Two questions:
- Shouldn't the tests still pass when setting the tolerance to
0
, as done here: https://github.com/libjxl/libjxl/runs/6917084785?check_suite_focus=true? - Is the
HwyMulTestGroup/HwyMulTest.TestAllMulHigh/Emu128
above something that needs to be fixed upstream?
Interesting. It seems the libjxl pipelines succeeded at the time of upgrade. I'm curious what caused it to change - perhaps a compiler update?
When running on Linux x86, I see the same expected values, so that's good. Seems to be the implementation of MulHigh. And indeed it's missing, for the int16 case, a cast to int32. Will add that shortly.
Interesting. It seems the libjxl pipelines succeeded at the time of upgrade. I'm curious what caused it to change - perhaps a compiler update?
Actually, "Build/Test / Windows MSYS2 / i686 (push)" has been failing ever since the upgrade.
Thanks for the quick fix!
You're welcome :)
Here's the "Windows MSYS2 / i686" I looked at: https://github.com/libjxl/libjxl/runs/6626839051?check_suite_focus=true
But there are indeed failing tests: https://github.com/libjxl/libjxl/runs/6919658022?check_suite_focus=true
It seems the "passing" tests were build-only, not actually running those tests?
It seems the "passing" tests were build-only, not actually running those tests?
Yes, exactly.
There are also asan failures with emu128:
Note: Google Test filter = ConvolveTestGroup/ConvolveTest.TestConvolve/Emu128
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from ConvolveTestGroup/ConvolveTest
[ RUN ] ConvolveTestGroup/ConvolveTest.TestConvolve/Emu128
AddressSanitizer:DEADLYSIGNAL
=================================================================
==19929==ERROR: AddressSanitizer: SEGV on unknown address 0x7f04a7eb344c (pc 0x5603b6203895 bp 0x000000001200 sp 0x7f00ee634d60 T4)
==19929==The signal is caused by a READ memory access.
#0 0x5603b6203894 in __sanitizer::CombinedAllocator<__sanitizer::SizeClassAllocator64<__asan::AP64>, __sanitizer::SizeClassAllocatorLocalCache<__sanitizer::SizeClassAllocator64<__asan::AP64> >, __sanitizer::LargeMmapAllocator<__asan::AsanMapUnmapCallback, __sanitizer::LargeMmapAllocatorPtrArrayDynamic> >::Allocate(__sanitizer::SizeClassAllocatorLocalCache<__sanitizer::SizeClassAllocator64<__asan::AP64> >*, unsigned long, unsigned long) (/home/runner/work/libjxl/libjxl/build/lib/tests/convolve_test+0xbf894)
#1 0x5603b61ff053 in __asan::Allocator::Allocate(unsigned long, unsigned long, __sanitizer::BufferedStackTrace*, __asan::AllocType, bool) (/home/runner/work/libjxl/libjxl/build/lib/tests/convolve_test+0xbb053)
#2 0x5603b61fedc9 in __asan::asan_malloc(unsigned long, __sanitizer::BufferedStackTrace*) (/home/runner/work/libjxl/libjxl/build/lib/tests/convolve_test+0xbadc9)
#3 0x5603b62a5071 in __interceptor_malloc (/home/runner/work/libjxl/libjxl/build/lib/tests/convolve_test+0x161071)
#4 0x5603b62fbc34 in jxl::CacheAligned::Allocate(unsigned long, unsigned long) /home/runner/work/libjxl/libjxl/lib/jxl/base/cache_aligned.cc:87:21
#5 0x5603b6e93fcc in jxl::CacheAligned::Allocate(unsigned long) /home/runner/work/libjxl/libjxl/lib/jxl/base/cache_aligned.h:43:12
#6 0x5603b6e93fcc in jxl::AllocateArray(unsigned long) /home/runner/work/libjxl/libjxl/lib/jxl/base/cache_aligned.h:61
#7 0x5603b6e93fcc in jxl::PlaneBase::PlaneBase(unsigned long, unsigned long, unsigned long) /home/runner/work/libjxl/libjxl/lib/jxl/image.cc:93
Not entirely sure if this is related, but it also started happening consistently after the highway upgrade. https://github.com/libjxl/libjxl/runs/6932368369?check_suite_focus=true
Interesting. It seems the libjxl pipelines succeeded at the time of upgrade. I'm curious what caused it to change - perhaps a compiler update?
When running on Linux x86, I see the same expected values, so that's good. Seems to be the implementation of MulHigh. And indeed it's missing, for the int16 case, a cast to int32. Will add that shortly.
Not sure if the MulHigh implementation is really relevant to this failure.
At least it seems like it didn't fix the problem, with 7a906ec139625de5ea3866603269cde82bdc761e, we still get the same failure:
Running main() from C:/M/mingw-w64-gtest/src/googletest-release-1.11.0/googletest/src/gtest_main.cc
Note: Google Test filter = HwyMulTestGroup/HwyMulTest.TestAllMulHigh/Emu128
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from HwyMulTestGroup/HwyMulTest
[ RUN ] HwyMulTestGroup/HwyMulTest.TestAllMulHigh/Emu128
i16x8 expect [0+ ->]:
0x3FFF,0x0FFF,0x03FF,0x00FF,0x003F,0x000F,0x0003,
i16x8 actual [0+ ->]:
0xBFFF,0x0FFF,0xE400,0x00FF,0xF840,0x000F,0xFE04,
hm, it might indeed be a compiler issue. I see both convolve_test and mul_test working with ci.sh asan (clang 13). What's running on the builder is clang-7. I see two options: updating the compiler on the builder, or opting out of EMU128 (and reverting to HWY_SCALAR) on certain platforms - any preference?
hm, it might indeed be a compiler issue. I see both convolve_test and mul_test working with ci.sh asan (clang 13). What's running on the builder is clang-7. I see two options: updating the compiler on the builder, or opting out of EMU128 (and reverting to HWY_SCALAR) on certain platforms - any preference?
On the failing "Windows MSYS2 / i686" test, the compiler seems to be mingw-w64-i686-gcc-12.1.0-2. Perhaps it is a bug in gcc 12.1.0-2? I will try with clang instead and also try reverting to HWY_SCALAR on that platform.
SG. FYI I've filed a bug for an unrelated GCC 12 issue - build timeout for Arm NEON.
I believe I can reproduce the issue on Linux/i686/gcc-12:
- https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106322
Something that is odd is that armel (32bits) is also affects (exact same symptoms), compare:
- https://buildd.debian.org/status/fetch.php?pkg=highway&arch=i386&ver=0.17.1%7Egit20220711.f0a396a-1&stamp=1657558093&raw=0
and
- https://buildd.debian.org/status/fetch.php?pkg=highway&arch=armel&ver=0.17.1%7Egit20220711.f0a396a-1&stamp=1657559820&raw=0
This has been fixed on GCC 12.x branch. I suspect we can close this bug:
- https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106322#c48
Thanks for following up 👍 Should we mark EMU128 as broken on GCC<12?
Thanks for following up 👍 Should we mark EMU128 as broken on GCC<12?
If you handle minor version, it should read as GCC<12.3
Thanks, changing to GCC 12.3 :)