Always take BLOSC_STRICT_ALIGN path
Unaligned access is UB, even on x86. UBsan will also detect this (-fsanitize=undefined or -fsanitize=alignment).
On arm, I hit SIGBUSes in pandas's test suite via pandas->pytables->c-blosc2 because of unaligned loads and stores.
Modern compilers are capable of optimising the "slow" path bitshifts and memcpy into faster alternatives where it is legal.
Bug: https://bugs.gentoo.org/911660 Bug: https://github.com/pandas-dev/pandas/issues/54391 Bug: https://github.com/pandas-dev/pandas/issues/54396
cc @mgorny @maskray @soapgentoo
Thanks! This gets rid of the conditions that are very difficult to get right:
/*
* Detect if the architecture is fine with unaligned access.
*/
#if !defined(BLOSC_STRICT_ALIGN)
#define BLOSC_STRICT_ALIGN
#if defined(__i386__) || defined(__386) || defined (__amd64) /* GNU C, Sun Studio */
#undef BLOSC_STRICT_ALIGN
#elif defined(__i486__) || defined(__i586__) || defined(__i686__) /* GNU C */
#undef BLOSC_STRICT_ALIGN
#elif defined(_M_IX86) || defined(_M_X64) /* Intel, MSVC */
#undef BLOSC_STRICT_ALIGN
#elif defined(__386)
#undef BLOSC_STRICT_ALIGN
#elif defined(_X86_) /* MinGW */
#undef BLOSC_STRICT_ALIGN
#elif defined(__I86__) /* Digital Mars */
#undef BLOSC_STRICT_ALIGN
/* Modern ARM systems (like ARM64) should support unaligned access
quite efficiently. */
#elif defined(__ARM_FEATURE_UNALIGNED) /* ARM, GNU C */
#undef BLOSC_STRICT_ALIGN
#elif defined(_ARCH_PPC) || defined(__PPC__)
/* Modern PowerPC systems (like POWER8) should support unaligned access
quite efficiently. */
#undef BLOSC_STRICT_ALIGN
#endif
#endif
https://github.com/riscv-non-isa/riscv-c-api-doc/pull/49 says "slow misaligned accesses may be very slow ". The long list would also cause continued problems with new architectures.
This is the only right way, trying to optimise for unaligned access is fraught with peril.
Modern compilers are capable of optimising the "slow" path bitshifts and memcpy into faster alternatives where it is legal.
Apparently this is not the case with at least MacOS with M1 and clang 14.0.3 (and I remember similar results with Intel/Linux). Using the bench suite for MacOS on M1, and using unaligned access (did several runs and kept the best result):
$ bench/b2bench blosclz shuffle single 8 8388608 4 19
<snip>
Round-trip compr/decompr on 2.5 GB
Elapsed time: 0.3 s, 17036.6 MB/s
and with your PR (same, did several runs and kept the best result):
$ bench/b2bench blosclz shuffle single 8 8388608 4 19
<snip>
Round-trip compr/decompr on 2.5 GB
Elapsed time: 0.5 s, 12415.2 MB/s
So, unaligned access is still 1.37x faster.
Using this:
#!/bin/bash
set -e -u
sudo cpupower frequency-set --governor performance
build_dir() {
branch=$1
builddir=$2
git checkout ${branch}
rm -rf ${builddir}
cmake -B ${builddir} -G Ninja && ninja -C ${builddir}
}
build_dir main build-gcc-generic-vanilla
CFLAGS="-O3" CXXFLAGS="-O3" build_dir main build-gcc-O3-vanilla
CFLAGS="-O3 -march=znver3" CXXFLAGS="-O3 -march=znver3" build_dir main build-gcc-znver2-vanilla
CC=clang-16 CXX=clang++-16 build_dir main build-clang-generic-vanilla
CC=clang-16 CXX=clang++-16 CFLAGS="-O3" CXXFLAGS="-O3" build_dir main build-clang-O3-vanilla
CC=clang-16 CXX=clang++-16 CFLAGS="-O3 -march=znver3" CXXFLAGS="-O3 -march=znver3" build_dir main build-clang-znver2-vanilla
build_dir no-unaligned-access-ub build-gcc-generic-new
CFLAGS="-O3" CXXFLAGS="-O3" build_dir main build-gcc-O3-new
CFLAGS="-O3 -march=znver3" CXXFLAGS="-O3 -march=znver3" build_dir no-unaligned-access-ub build-gcc-znver2-new
CC=clang-16 CXX=clang++-16 build_dir no-unaligned-access-ub build-clang-generic-new
CC=clang-16 CXX=clang++-16 CFLAGS="-O3" CXXFLAGS="-O3" build_dir main build-clang-O3-new
CC=clang-16 CXX=clang++-16 CFLAGS="-O3 -march=znver3" CXXFLAGS="-O3 -march=znver3" build_dir no-unaligned-access-ub build-clang-znver2-new
# GCC (default)
hyperfine --warmup 5 --min-runs 100 \
--prepare 'sync ; echo 3 | sudo tee /proc/sys/vm/drop_caches' \
"build-gcc-generic-vanilla/bench/b2bench blosclz shuffle single 8 8388608 4 19" \
"build-gcc-generic-new/bench/b2bench blosclz shuffle single 8 8388608 4 19" \
# GCC (-O3)
hyperfine --warmup 5 --min-runs 100 \
--prepare 'sync ; echo 3 | sudo tee /proc/sys/vm/drop_caches' \
"build-gcc-O3-vanilla/bench/b2bench blosclz shuffle single 8 8388608 4 19" \
"build-gcc-O3-new/bench/b2bench blosclz shuffle single 8 8388608 4 19"
# Clang (default)
hyperfine --warmup 5 --min-runs 100 \
--prepare 'sync ; echo 3 | sudo tee /proc/sys/vm/drop_caches' \
"build-clang-generic-vanilla/bench/b2bench blosclz shuffle single 8 8388608 4 19" \
"build-clang-generic-new/bench/b2bench blosclz shuffle single 8 8388608 4 19" \
# Clang (-O3)
hyperfine --warmup 5 --min-runs 100 \
--prepare 'sync ; echo 3 | sudo tee /proc/sys/vm/drop_caches' \
"build-clang-O3-vanilla/bench/b2bench blosclz shuffle single 8 8388608 4 19" \
"build-clang-O3-new/bench/b2bench blosclz shuffle single 8 8388608 4 19"
# GCC (-O3 -march=znver2)
hyperfine --warmup 5 --min-runs 100 \
--prepare 'sync ; echo 3 | sudo tee /proc/sys/vm/drop_caches' \
"build-gcc-znver2-vanilla/bench/b2bench blosclz shuffle single 8 8388608 4 19" \
"build-gcc-znver2-new/bench/b2bench blosclz shuffle single 8 8388608 4 19"
# Clang (-O3 -march=znver2)
hyperfine --warmup 5 --min-runs 100 \
--prepare 'sync ; echo 3 | sudo tee /proc/sys/vm/drop_caches' \
"build-clang-znver2-vanilla/bench/b2bench blosclz shuffle single 8 8388608 4 19" \
"build-clang-znver2-new/bench/b2bench blosclz shuffle single 8 8388608 4 19"
sudo cpupower frequency-set --governor schedutil
On Linux 6.4.11 with AMD Zen 2 (3950x) on tmpfs with gcc 13.2.1 20230819 and clang 16.0.6:
Benchmark 1: build-gcc-generic-vanilla/bench/b2bench blosclz shuffle single 8 8388608 4 19
Time (mean ± σ): 421.0 ms ± 4.0 ms [User: 1124.5 ms, System: 240.3 ms]
Range (min … max): 412.7 ms … 430.3 ms 100 runs
Benchmark 2: build-gcc-generic-new/bench/b2bench blosclz shuffle single 8 8388608 4 19
Time (mean ± σ): 425.0 ms ± 3.7 ms [User: 1160.9 ms, System: 242.4 ms]
Range (min … max): 417.4 ms … 436.3 ms 100 runs
Summary
'build-gcc-generic-vanilla/bench/b2bench blosclz shuffle single 8 8388608 4 19' ran
1.01 ± 0.01 times faster than 'build-gcc-generic-new/bench/b2bench blosclz shuffle single 8 8388608 4 19'
Benchmark 1: build-gcc-O3-vanilla/bench/b2bench blosclz shuffle single 8 8388608 4 19
Time (mean ± σ): 421.7 ms ± 4.3 ms [User: 1121.6 ms, System: 243.5 ms]
Range (min … max): 413.0 ms … 430.0 ms 100 runs
Benchmark 2: build-gcc-O3-new/bench/b2bench blosclz shuffle single 8 8388608 4 19
Time (mean ± σ): 423.3 ms ± 4.0 ms [User: 1122.0 ms, System: 244.7 ms]
Range (min … max): 413.6 ms … 440.0 ms 100 runs
Summary
'build-gcc-O3-vanilla/bench/b2bench blosclz shuffle single 8 8388608 4 19' ran
1.00 ± 0.01 times faster than 'build-gcc-O3-new/bench/b2bench blosclz shuffle single 8 8388608 4 19'
Benchmark 1: build-clang-generic-vanilla/bench/b2bench blosclz shuffle single 8 8388608 4 19
Time (mean ± σ): 420.3 ms ± 3.9 ms [User: 1104.9 ms, System: 240.5 ms]
Range (min … max): 411.3 ms … 426.0 ms 100 runs
Benchmark 2: build-clang-generic-new/bench/b2bench blosclz shuffle single 8 8388608 4 19
Time (mean ± σ): 415.3 ms ± 4.3 ms [User: 1065.3 ms, System: 245.5 ms]
Range (min … max): 406.8 ms … 422.7 ms 100 runs
Summary
'build-clang-generic-new/bench/b2bench blosclz shuffle single 8 8388608 4 19' ran
1.01 ± 0.01 times faster than 'build-clang-generic-vanilla/bench/b2bench blosclz shuffle single 8 8388608 4 19'
So, on Linux/AMD, it's within noise.
On macOS 13.5, Intel i9-9880H, Apple clang version 14.0.3 (clang-1403.0.22.14.1), the variation is terrible (and the result is very inconsistent, swinging in both directions) and I suspect it's because of throttling. I've asked someone if I can try on their arm mac mini.
Using this: ... So, on Linux/AMD, it's within noise.
This is interesting. With Linux/Intel I still see a difference (see later).
On macOS 13.5, Intel i9-9880H, Apple clang version 14.0.3 (clang-1403.0.22.14.1), the variation is terrible (and the result is very inconsistent, swinging in both directions) and I suspect it's because of throttling. I've asked someone if I can try on their arm mac mini.
What could be happening here is that threading makes throttling to affect measurements a lot. For this benchmark, I think it is better to restrict oneself to 1 single thread. With that, on MacOS 13.4.1, M1 and clang 14.0.3, and unaligned reads:
$ bench/b2bench blosclz shuffle single 1 8388608 4 19
<snip>
Round-trip compr/decompr on 2.5 GB
Elapsed time: 0.8 s, 6682.5 MB/s
Whereas with the current version of this PR:
$ bench/b2bench blosclz shuffle single 1 8388608 4 19
<snip>
Round-trip compr/decompr on 2.5 GB
Elapsed time: 1.3 s, 4463.2 MB/s
So in this case, unaligned reads makes the compression operation 1.5x faster.
For an Intel i9-13900K / Linux 6.2.0 (Ubuntu 23.04), gcc 12.3.0 and unaligned reads:
$ bench/b2bench blosclz shuffle single 1 8388608 4 19
<snip>
Round-trip compr/decompr on 2.5 GB
Elapsed time: 0.6 s, 9758.3 MB/s
With the current version of this PR:
$ bench/b2bench blosclz shuffle single 1 8388608 4 19
<snip>
Round-trip compr/decompr on 2.5 GB
Elapsed time: 0.6 s, 9064.7 MB/s
In this case the speed-up is less, but still, unaligned reads produces non-negligible faster compression operation (1.07x).
memcpy(dst, src, 4); has been optimized by GCC for 10+ years (check godbolt 4.1 from 2007), if not 20+/30+ years.
(I think MSVC should have had the optimization for 10+ years as well.)
#define NDLZ_READU32(p) ((p)[0] | (p)[1]<<8 | (p)[2]<<16 | (p)[3]<<24) can be optimized as well.
void foo(const unsigned char *s, unsigned *d) {
*d = s[0]|s[1]<<8|s[2]<<16|s[3]<<24;
}
but it may not be optimized so well by GCC or Clang in some cases.
I can reproduce bench/b2bench blosclz shuffle single 1 8388608 4 19 slowdown with this patch, but only with GCC, not Clang.
I create two build directories for main (out/gcc0) and no-unaligned-access-ub (out/gcc1).
% cat =cmp-objdump
#!/bin/zsh
nvim -d =(/tmp/Rel/bin/llvm-objdump -d --symbolize-operands -M intel --no-leading-addr --no-show-raw-insn $1 --disassemble-symbols=$(</tmp/0)) =(/tmp/Rel/bin/llvm-objdump -d --symbolize-operands -M intel --no-leading-addr --no-show-raw-insn $2 --disassemble-symbols=$(</tmp/0))
% cmp-objdump out/gcc[01]/bench/b2bench
out/gcc1/bench/b2bench has larger blosclz_compress due to GCC unoptimized #define NDLZ_READU32(p) ((p)[0] | (p)[1]<<8 | (p)[2]<<16 | (p)[3]<<24)
movzx esi, byte ptr [rax - 0x2]
movzx edi, byte ptr [rax - 0x1]
sub rdx, rbx
add r9d, 0x1
xor r8d, r8d
shl edi, 0x10
shl esi, 0x8
or esi, edi
movzx edi, byte ptr [rax - 0x3]
or esi, edi
movzx edi, byte ptr [rax]
shl edi, 0x18
or esi, edi
imul esi, esi, 0x9e3779b1
shr esi, cl
I suggest that we replace NDLZ_READU32 with a function that loads the unaligned bytes with memcpy.
I'll play around with this some more during the week and let you know if/when it needs re-review. Thanks!
If it helps, creating two patches may be feasible.
One is to replace things like *(uint64_t*)out = *(uint64_t*)from; with memcpy. This is always fine for optimizations and usually evens leads to identical output file.
The other is on #define NDLZ_READU32(p) ((p)[0] | (p)[1]<<8 | (p)[2]<<16 | (p)[3]<<24) and/or removing the BLOSC_STRICT_ALIGN setting.
The idea is that the first patch is entirely safe and can be merged immediately, while the second needs performance evaluation.
Any news here?
Closing due to inactivity
Yeah, sorry I haven't come back to this yet. I'll try to come back to it but I've not had a chance. Thanks.