c-blosc2 icon indicating copy to clipboard operation
c-blosc2 copied to clipboard

Always take BLOSC_STRICT_ALIGN path

Open thesamesam opened this issue 2 years ago • 11 comments

Unaligned access is UB, even on x86. UBsan will also detect this (-fsanitize=undefined or -fsanitize=alignment).

On arm, I hit SIGBUSes in pandas's test suite via pandas->pytables->c-blosc2 because of unaligned loads and stores.

Modern compilers are capable of optimising the "slow" path bitshifts and memcpy into faster alternatives where it is legal.

Bug: https://bugs.gentoo.org/911660 Bug: https://github.com/pandas-dev/pandas/issues/54391 Bug: https://github.com/pandas-dev/pandas/issues/54396

thesamesam avatar Aug 21 '23 07:08 thesamesam

cc @mgorny @maskray @soapgentoo

thesamesam avatar Aug 21 '23 07:08 thesamesam

Thanks! This gets rid of the conditions that are very difficult to get right:

/*
 * Detect if the architecture is fine with unaligned access.
 */
#if !defined(BLOSC_STRICT_ALIGN)
#define BLOSC_STRICT_ALIGN
#if defined(__i386__) || defined(__386) || defined (__amd64)  /* GNU C, Sun Studio */
#undef BLOSC_STRICT_ALIGN
#elif defined(__i486__) || defined(__i586__) || defined(__i686__)  /* GNU C */
#undef BLOSC_STRICT_ALIGN
#elif defined(_M_IX86) || defined(_M_X64)   /* Intel, MSVC */
#undef BLOSC_STRICT_ALIGN
#elif defined(__386)
#undef BLOSC_STRICT_ALIGN
#elif defined(_X86_) /* MinGW */
#undef BLOSC_STRICT_ALIGN
#elif defined(__I86__) /* Digital Mars */
#undef BLOSC_STRICT_ALIGN
/* Modern ARM systems (like ARM64) should support unaligned access
   quite efficiently. */
#elif defined(__ARM_FEATURE_UNALIGNED)   /* ARM, GNU C */
#undef BLOSC_STRICT_ALIGN
#elif defined(_ARCH_PPC) || defined(__PPC__)
/* Modern PowerPC systems (like POWER8) should support unaligned access
   quite efficiently. */
#undef BLOSC_STRICT_ALIGN
#endif
#endif

https://github.com/riscv-non-isa/riscv-c-api-doc/pull/49 says "slow misaligned accesses may be very slow ". The long list would also cause continued problems with new architectures.

MaskRay avatar Aug 21 '23 07:08 MaskRay

This is the only right way, trying to optimise for unaligned access is fraught with peril.

SoapGentoo avatar Aug 21 '23 07:08 SoapGentoo

Modern compilers are capable of optimising the "slow" path bitshifts and memcpy into faster alternatives where it is legal.

Apparently this is not the case with at least MacOS with M1 and clang 14.0.3 (and I remember similar results with Intel/Linux). Using the bench suite for MacOS on M1, and using unaligned access (did several runs and kept the best result):

$  bench/b2bench blosclz shuffle single 8 8388608 4 19
<snip>
Round-trip compr/decompr on 2.5 GB
Elapsed time:	    0.3 s, 17036.6 MB/s

and with your PR (same, did several runs and kept the best result):

$  bench/b2bench blosclz shuffle single 8 8388608 4 19
<snip>
Round-trip compr/decompr on 2.5 GB
Elapsed time:	    0.5 s, 12415.2 MB/s

So, unaligned access is still 1.37x faster.

FrancescAlted avatar Aug 21 '23 08:08 FrancescAlted

Using this:

#!/bin/bash
set -e -u

sudo cpupower frequency-set --governor performance

build_dir() {
	branch=$1
	builddir=$2

	git checkout ${branch}
	rm -rf ${builddir}
	cmake -B ${builddir} -G Ninja && ninja -C ${builddir}
}

build_dir main build-gcc-generic-vanilla
CFLAGS="-O3" CXXFLAGS="-O3" build_dir main build-gcc-O3-vanilla
CFLAGS="-O3 -march=znver3" CXXFLAGS="-O3 -march=znver3" build_dir main build-gcc-znver2-vanilla
CC=clang-16 CXX=clang++-16 build_dir main build-clang-generic-vanilla
CC=clang-16 CXX=clang++-16 CFLAGS="-O3" CXXFLAGS="-O3" build_dir main build-clang-O3-vanilla
CC=clang-16 CXX=clang++-16 CFLAGS="-O3 -march=znver3" CXXFLAGS="-O3 -march=znver3" build_dir main build-clang-znver2-vanilla

build_dir no-unaligned-access-ub build-gcc-generic-new
CFLAGS="-O3" CXXFLAGS="-O3" build_dir main build-gcc-O3-new
CFLAGS="-O3 -march=znver3" CXXFLAGS="-O3 -march=znver3" build_dir no-unaligned-access-ub build-gcc-znver2-new
CC=clang-16 CXX=clang++-16 build_dir no-unaligned-access-ub build-clang-generic-new
CC=clang-16 CXX=clang++-16 CFLAGS="-O3" CXXFLAGS="-O3" build_dir main build-clang-O3-new
CC=clang-16 CXX=clang++-16 CFLAGS="-O3 -march=znver3" CXXFLAGS="-O3 -march=znver3" build_dir no-unaligned-access-ub build-clang-znver2-new

# GCC (default)
hyperfine --warmup 5 --min-runs 100 \
	--prepare 'sync ; echo 3 | sudo tee /proc/sys/vm/drop_caches' \
	"build-gcc-generic-vanilla/bench/b2bench blosclz shuffle single 8 8388608 4 19" \
	"build-gcc-generic-new/bench/b2bench blosclz shuffle single 8 8388608 4 19" \
# GCC (-O3)
hyperfine --warmup 5 --min-runs 100 \
	--prepare 'sync ; echo 3 | sudo tee /proc/sys/vm/drop_caches' \
	"build-gcc-O3-vanilla/bench/b2bench blosclz shuffle single 8 8388608 4 19" \
	"build-gcc-O3-new/bench/b2bench blosclz shuffle single 8 8388608 4 19"

# Clang (default)
hyperfine --warmup 5 --min-runs 100 \
	--prepare 'sync ; echo 3 | sudo tee /proc/sys/vm/drop_caches' \
	"build-clang-generic-vanilla/bench/b2bench blosclz shuffle single 8 8388608 4 19" \
	"build-clang-generic-new/bench/b2bench blosclz shuffle single 8 8388608 4 19" \
# Clang (-O3)
hyperfine --warmup 5 --min-runs 100 \
	--prepare 'sync ; echo 3 | sudo tee /proc/sys/vm/drop_caches' \
	"build-clang-O3-vanilla/bench/b2bench blosclz shuffle single 8 8388608 4 19" \
	"build-clang-O3-new/bench/b2bench blosclz shuffle single 8 8388608 4 19"

# GCC (-O3 -march=znver2)
hyperfine --warmup 5 --min-runs 100 \
	--prepare 'sync ; echo 3 | sudo tee /proc/sys/vm/drop_caches' \
	"build-gcc-znver2-vanilla/bench/b2bench blosclz shuffle single 8 8388608 4 19" \
	"build-gcc-znver2-new/bench/b2bench blosclz shuffle single 8 8388608 4 19"

# Clang (-O3 -march=znver2)
hyperfine --warmup 5 --min-runs 100 \
	--prepare 'sync ; echo 3 | sudo tee /proc/sys/vm/drop_caches' \
        "build-clang-znver2-vanilla/bench/b2bench blosclz shuffle single 8 8388608 4 19" \
        "build-clang-znver2-new/bench/b2bench blosclz shuffle single 8 8388608 4 19"

sudo cpupower frequency-set --governor schedutil

On Linux 6.4.11 with AMD Zen 2 (3950x) on tmpfs with gcc 13.2.1 20230819 and clang 16.0.6:

Benchmark 1: build-gcc-generic-vanilla/bench/b2bench blosclz shuffle single 8 8388608 4 19
  Time (mean ± σ):     421.0 ms ±   4.0 ms    [User: 1124.5 ms, System: 240.3 ms]
  Range (min … max):   412.7 ms … 430.3 ms    100 runs

Benchmark 2: build-gcc-generic-new/bench/b2bench blosclz shuffle single 8 8388608 4 19
  Time (mean ± σ):     425.0 ms ±   3.7 ms    [User: 1160.9 ms, System: 242.4 ms]
  Range (min … max):   417.4 ms … 436.3 ms    100 runs

Summary
  'build-gcc-generic-vanilla/bench/b2bench blosclz shuffle single 8 8388608 4 19' ran
    1.01 ± 0.01 times faster than 'build-gcc-generic-new/bench/b2bench blosclz shuffle single 8 8388608 4 19'
Benchmark 1: build-gcc-O3-vanilla/bench/b2bench blosclz shuffle single 8 8388608 4 19
  Time (mean ± σ):     421.7 ms ±   4.3 ms    [User: 1121.6 ms, System: 243.5 ms]
  Range (min … max):   413.0 ms … 430.0 ms    100 runs

Benchmark 2: build-gcc-O3-new/bench/b2bench blosclz shuffle single 8 8388608 4 19
  Time (mean ± σ):     423.3 ms ±   4.0 ms    [User: 1122.0 ms, System: 244.7 ms]
  Range (min … max):   413.6 ms … 440.0 ms    100 runs

Summary
  'build-gcc-O3-vanilla/bench/b2bench blosclz shuffle single 8 8388608 4 19' ran
    1.00 ± 0.01 times faster than 'build-gcc-O3-new/bench/b2bench blosclz shuffle single 8 8388608 4 19'
Benchmark 1: build-clang-generic-vanilla/bench/b2bench blosclz shuffle single 8 8388608 4 19
  Time (mean ± σ):     420.3 ms ±   3.9 ms    [User: 1104.9 ms, System: 240.5 ms]
  Range (min … max):   411.3 ms … 426.0 ms    100 runs

Benchmark 2: build-clang-generic-new/bench/b2bench blosclz shuffle single 8 8388608 4 19
  Time (mean ± σ):     415.3 ms ±   4.3 ms    [User: 1065.3 ms, System: 245.5 ms]
  Range (min … max):   406.8 ms … 422.7 ms    100 runs

Summary
  'build-clang-generic-new/bench/b2bench blosclz shuffle single 8 8388608 4 19' ran
    1.01 ± 0.01 times faster than 'build-clang-generic-vanilla/bench/b2bench blosclz shuffle single 8 8388608 4 19'

So, on Linux/AMD, it's within noise.

On macOS 13.5, Intel i9-9880H, Apple clang version 14.0.3 (clang-1403.0.22.14.1), the variation is terrible (and the result is very inconsistent, swinging in both directions) and I suspect it's because of throttling. I've asked someone if I can try on their arm mac mini.

thesamesam avatar Aug 21 '23 10:08 thesamesam

Using this: ... So, on Linux/AMD, it's within noise.

This is interesting. With Linux/Intel I still see a difference (see later).

On macOS 13.5, Intel i9-9880H, Apple clang version 14.0.3 (clang-1403.0.22.14.1), the variation is terrible (and the result is very inconsistent, swinging in both directions) and I suspect it's because of throttling. I've asked someone if I can try on their arm mac mini.

What could be happening here is that threading makes throttling to affect measurements a lot. For this benchmark, I think it is better to restrict oneself to 1 single thread. With that, on MacOS 13.4.1, M1 and clang 14.0.3, and unaligned reads:

$ bench/b2bench blosclz shuffle single 1 8388608 4 19
<snip>
Round-trip compr/decompr on 2.5 GB
Elapsed time:	    0.8 s, 6682.5 MB/s

Whereas with the current version of this PR:

$ bench/b2bench blosclz shuffle single 1 8388608 4 19
<snip>
Round-trip compr/decompr on 2.5 GB
Elapsed time:	    1.3 s, 4463.2 MB/s

So in this case, unaligned reads makes the compression operation 1.5x faster.

For an Intel i9-13900K / Linux 6.2.0 (Ubuntu 23.04), gcc 12.3.0 and unaligned reads:

$ bench/b2bench blosclz shuffle single 1 8388608 4 19
<snip>
Round-trip compr/decompr on 2.5 GB
Elapsed time:	    0.6 s, 9758.3 MB/s

With the current version of this PR:

$ bench/b2bench blosclz shuffle single 1 8388608 4 19
<snip>
Round-trip compr/decompr on 2.5 GB
Elapsed time:	    0.6 s, 9064.7 MB/s

In this case the speed-up is less, but still, unaligned reads produces non-negligible faster compression operation (1.07x).

FrancescAlted avatar Aug 21 '23 17:08 FrancescAlted

memcpy(dst, src, 4); has been optimized by GCC for 10+ years (check godbolt 4.1 from 2007), if not 20+/30+ years. (I think MSVC should have had the optimization for 10+ years as well.)

#define NDLZ_READU32(p) ((p)[0] | (p)[1]<<8 | (p)[2]<<16 | (p)[3]<<24) can be optimized as well.

void foo(const unsigned char *s, unsigned *d) {
    *d = s[0]|s[1]<<8|s[2]<<16|s[3]<<24;
}

but it may not be optimized so well by GCC or Clang in some cases.

MaskRay avatar Aug 21 '23 19:08 MaskRay

I can reproduce bench/b2bench blosclz shuffle single 1 8388608 4 19 slowdown with this patch, but only with GCC, not Clang.

I create two build directories for main (out/gcc0) and no-unaligned-access-ub (out/gcc1).

% cat =cmp-objdump
#!/bin/zsh
nvim -d =(/tmp/Rel/bin/llvm-objdump -d --symbolize-operands -M intel --no-leading-addr --no-show-raw-insn $1 --disassemble-symbols=$(</tmp/0)) =(/tmp/Rel/bin/llvm-objdump -d --symbolize-operands -M intel --no-leading-addr --no-show-raw-insn $2 --disassemble-symbols=$(</tmp/0))
% cmp-objdump out/gcc[01]/bench/b2bench

out/gcc1/bench/b2bench has larger blosclz_compress due to GCC unoptimized #define NDLZ_READU32(p) ((p)[0] | (p)[1]<<8 | (p)[2]<<16 | (p)[3]<<24)

                movzx   esi, byte ptr [rax - 0x2]
                movzx   edi, byte ptr [rax - 0x1]
                sub     rdx, rbx
                add     r9d, 0x1
                xor     r8d, r8d
                shl     edi, 0x10
                shl     esi, 0x8
                or      esi, edi
                movzx   edi, byte ptr [rax - 0x3]
                or      esi, edi
                movzx   edi, byte ptr [rax]
                shl     edi, 0x18
                or      esi, edi
                imul    esi, esi, 0x9e3779b1
                shr     esi, cl

I suggest that we replace NDLZ_READU32 with a function that loads the unaligned bytes with memcpy.

MaskRay avatar Aug 21 '23 20:08 MaskRay

I'll play around with this some more during the week and let you know if/when it needs re-review. Thanks!

thesamesam avatar Aug 22 '23 20:08 thesamesam

If it helps, creating two patches may be feasible.

One is to replace things like *(uint64_t*)out = *(uint64_t*)from; with memcpy. This is always fine for optimizations and usually evens leads to identical output file.

The other is on #define NDLZ_READU32(p) ((p)[0] | (p)[1]<<8 | (p)[2]<<16 | (p)[3]<<24) and/or removing the BLOSC_STRICT_ALIGN setting.

The idea is that the first patch is entirely safe and can be merged immediately, while the second needs performance evaluation.

MaskRay avatar Aug 22 '23 20:08 MaskRay

Any news here?

FrancescAlted avatar May 15 '24 07:05 FrancescAlted

Closing due to inactivity

FrancescAlted avatar Jun 11 '24 06:06 FrancescAlted

Yeah, sorry I haven't come back to this yet. I'll try to come back to it but I've not had a chance. Thanks.

thesamesam avatar Jun 11 '24 22:06 thesamesam