zstd Add fast huf_dec with generic C and tuned aarch64 assembly

This includes implementations of a generic C version of fast decode and a tuned 4x1 assembly version for Arm. For silesia, observed 3.9% for sao, ~2% for mozilla/ooffice/osdb/x-ray. As the author of the original algorithm, could you pls help to review this, @terrelln ? Thanks a lot.

Jun 07 '22 07:06 JunHe77

Whats the difference between the C version and the ASM version?

It looks like the gains for the aarch64 version are smaller than the gains for the x86-64 version, which could make sense because aarch64 has way less register pressure, and that was the main constraint on x86-64.

For those smaller gains, I'd be a bit hesitant about merging an ASM implementation. But would be more open to a generic C version.

Jul 29 '22 17:07 terrelln

Thanks for the comments, @terrelln . I understand the maintenance effort for an assembly implementation. Following is the change comparison between C and ASM for silesia at L8. Pls check.

data file	ASM/C on N1	ASM/C on A72
dickens	-0.19%	1.22%
mozilla	1.17%	0.33%
mr	0.21%	-1.38%
nci	0.29%	0.76%
ooffice	1.01%	-0.12%
osdb	0.92%	0.54%
reymont	0.11%	1.27%
samba	0.19%	1.51%
sao	2.32%	1.87%
webster	0.17%	0.42%
x-ray	0.97%	1.42%
xml	-0.21%	0.16%

In 5 of 12 cases, ASm version achieves 1%+ better performance (~2% for sao on both N1 and A72).

Aug 08 '22 10:08 JunHe77

Is the C version the current zstd code, or the fast decode C you've written? If it is the latter, what is the difference between zstd's code and the tuned C?

Aug 16 '22 17:08 terrelln

The C version is a kind of "rewritten asm in C" of huf_decompress_amd64.S. I started the porting on Arm with C, then hand tuned asm version for both 4x1 and 4x2. For C version, both 4x1 and 4x2 showed uplift on Arm. While further boosts are observed on certain test data with 4x1_asm, no obvious boost is found with 4x2_asm. That's why there is only 4x1 asm version in this PR. 😄

Aug 17 '22 01:08 JunHe77

Hi @terrelln, anything I need to follow for this PR? Thanks.

Jan 11 '23 06:01 JunHe77

Well received, @terrelln . Will do the benchmark once your PR is ready. Thanks.

Jan 16 '23 06:01 JunHe77

Hi @JunHe77, I've just merged PR #3449.

I've found that on my M1 chip those loops perform at least as well as your C versions. But I would be happy to accept patches to the fast C loops if you can find a faster variant.

If the assembly implementation is still significantly faster than the fast C variant, I would be willing to accept the aarch64 assembly implementation. But it would have to be disabled by default, and require the definition of a macro in the build process ZSTD_ENABLE_AARCH64_ASM to use.

The reason it would have to be disabled by default, is that we have no way to continuously fuzz the code. All of our fuzzers run on x86-64 and i386. We want to minimize the amount of code that isn't fuzzed in zstd. I trust that your assembly is correct, but I don't feel comfortable including any non-trivial code in zstd that isn't fuzzed.

If we merge it disabled by default, and later oss-fuzz adds support for aarch64, we could start fuzzing that code and switch it to enabled by default.

Jan 25 '23 21:01 terrelln

zstd zstd copied to clipboard

Add fast huf_dec with generic C and tuned aarch64 assembly

zstd
zstd copied to clipboard