xxHash
xxHash copied to clipboard
Refactor to improve code size, fix GCC 12
Overview
Kinda a dump of (mostly) related changes I've been sitting on and never bothered to commit.
This is a set of changes to:
- Reduce code size on
-O3
- Especially on 32-bit, where
-O3
could get well over 100 kB.text
- I feel this could still be improved if
hashLong
constant propagation can be reduced
- Especially on 32-bit, where
- Improve
-Os
/XXH_SIZE_OPT
performance - Make things more clean and modular
These haven't been micro-benchmarked yet. The "what is worth inlining" is a rough estimate, and I want to go through the various recent compilers to find what works best.
Major changes:
- GCC 12 will now be forced to
-O3
to work around the-Og
bug. Fixes #800, #720- Only applies when
__OPTIMIZE__
,!__NO_INLINE__
andXXH_SIZE_OPT <= 0
(!__OPTIMIZE_SIZE__
), so-O2
and-Og
both apply - I tried with adjusting the inline hints but that tanked performance.
- Only applies when
-
XXH_INLINE
: Acts like existingXXH_FORCE_INLINE
where it is disabled whenXXH_SIZE_OPT
is used.-
XXH_FORCE_INLINE
now actually force inlines unconditionally (when inline hints are on) and is used on XXH3 and XXH32's core loops and utilities. - Greatly improves GCC
-Os
performance with only a small code size overhead
-
-
XXH_32BIT_OPT
,XXH_INLINE_64BIT
: These are used to reduce code bloat on 32-bit- This with the other related optimizations reduce
clang armv4t -O3
code size by about half, now being <100 kB
- This with the other related optimizations reduce
-
XXH32
andXXH64
have been rewritten to reuse the algorithm steps -
XXH3_update
has been rewritten to be much easier to understand and much smaller. -
XXH3_128bits_update
now tail callsXXH3_64bits_update
, greatly reducing code size at the cost of one branch
Minor changes:
- GCC 11 and higher no longer force
-O2
on AVX2, that bug has been fixed. -
XXH32_state_t
andXXH64_state_t
have some fields renamed and now useunsigned char [N]
for their buffer -
XXH_OLD_NAMES
has been removed, it's been long enough - Fixed a few inline hints
- Some reused snippets in XXH3 have been extracted into functions.
- Fixed a theoretical integer overflow in the update code
Seems like some apt repos are down which is causing some things to fail.
I also have an idea that seems to get full performance (at least on large scale, I'll have to test the latency) with only 2 copies of XXH3_hashLong_internal_loop
.
This manages to get clang aarch64 -O3
to 24 kB compared to 36 kB, and clang armv4t -O3
from 92 kB to 61 kB (far less than half of the 160 kB before this PR)
typedef struct {
XXH_ALIGN_MEMBER(64, xxh_u64 acc[8]);
} XXH3_acc_t;
XXH_FORCE_INLINE XXH3_acc_t
XXH3_hashLong_internal_loop_impl(...)
{
XXH3_acc_t acc = { XXH3_INIT_ACC };
// normal internal_loop stuff
return acc;
}
XXH_NO_INLINE XXH3_acc_t
XXH3_hashLong(const xxh_u8* XXH_RESTRICT input, size_t len,
const xxh_u8* XXH_RESTRICT secret, size_t secretSize)
{
return XXH3_hashLong_internal_loop_impl(input, len, secret, secretSize, XXH3_accumulate,
XXH3_scrambleAcc);
}
#if XXH_SIZE_OPT <= 0
/* Constant folded for the default secret size */
XXH_NO_INLINE XXH3_acc_t
XXH3_hashLong_defaultSecretSize(const xxh_u8* XXH_RESTRICT input, size_t len,
const xxh_u8* XXH_RESTRICT secret, size_t secretSize)
{
(void) secretSize;
return XXH3_hashLong_internal_loop_impl(input, len, secret, sizeof(XXH3_kSecret), XXH3_accumulate,
XXH3_scrambleAcc);
}
#else
/* Don't constant fold on size opt */
# define XXH3_hashLong_defaultSecretSize XXH3_hashLong
#endif
Returning a struct like this is functionally identical to memcpying to an out pointer on the ABI level, but the compiler knows more about it and can optimize it better. (Just having an output pointer without an inlining is 40% slower).
I would have to clean it up, rename it, and test it some more though, so I would probably keep this to another PR.
Edit: This could be a step towards making dispatching on x86 standard as well since the overridden functions are now only 2-3 (also the feature test is 99% logging - it normally compiles to like 50 instructions).
How do the rest of these changes look aside from that (which since it would affect dispatching is a huge change that will need some more testing)? The changes shouldn't affect 64-bit codegen that much (although XXH3_update may be worth some benchmarking)
The biggest changes codegen-wise are 32-bit not being forced to inline and unroll ugly functions as much (which have more latency than a call frame/loop and aren't worth the code size increase) and -Os
being faster.
I'm going to leave the hashLong rework to another PR because that would need reworking dispatching, and that would be a perfect time to redo that mess.
I'm trying to get a sense of the scope of the changes in this PR.
To rephrase it :
- It's mostly focused on reducing the binary size, though speed might also be affected as a side effect
- It mostly concerns
-Os
compilation mode, performance modes-O2
and-O3
should not be negatively impacted - It mostly concerns 32-bit compilation mode, 64-bit should not be negatively impacted
- Speed change is expected to be roughly neutral, except for 32-bit
-Os
where it's expected to be positive. - Not sure if
arm
target is differently impacted thanx86
/x64
one.
If that's the correct scope, then there are a number of measurements to do and publish to document this PR.
Longer term, I wonder if there is something that could / should be done to ensure these properties are preserved in the future. Otherwise, it would be easy for a future PR to inflate the binary size without anyone noticing it.
I made a series of measurements, leading to surprising results (at least for me), when comparing this pr802
to dev
branch.
Let's start with x64
compilation mode. Compiler is gcc
v9.4.0
, cpu is i7-9700k
(turbo off). Performance flag only, meaning the vector extension is default SSE2
. Let's measure the size of produced libxxhash
.
compiler flags | dev lib size |
pr802 lib size |
difference |
---|---|---|---|
-O3 |
76392 | 60080 | -16312 |
-O2 |
51936 | 43912 | -8024 |
-Os |
32208 | 36600 | +4392 |
So that's the first surprise : the changes are beneficial to binary sizes for the performance mode, but when it comes to the size optimized -Os
mode, binary size is actually increased. It's not necessarily a bad deal, but since it seems to achieve the inverse of what's advertised, I figured it's worth underlining.
Now let's have a look to performance. The scenario measured is the default -b
, aka 100 KB (long hash). Numbers measure XXH32
, XXH64
and XXH128
successively
compiler flags | dev GB/s |
pr802 GB/s |
note |
---|---|---|---|
-O3 |
6.9, 13.7, 20.1 | 6.9, 13.7, 20.1 | identical |
-O2 |
6.9, 13.7, 14.0 | 6.9, 13.7, 14.0 | identical |
-Os |
6.9, 13.7, 12.6 | 6.9, 13.7, 14.9 | XXH128 is improved, even faster than -O2 |
To begin with, at -O3
and -O2
, pr802
features no speed change. So, combined with the lower binary size, this pr802
is a win globally. Nicely done.
Note how performance of XXH128
plummets seriously at -O2
, which is probably because gcc
needs -O3
to generate efficient vector code. XXH32
and XXH64
in contrast remain unaffected.
Finely, at -Os
, XXH32
and XXH64
remain performance neutral, while XXH128
improves with pr802
, to the point of running slightly faster than -O2
(yet far from -O3
). I wonder if there are some useful learning there.
That makes -Os
plainly better than -O2
on pr802
, since it wins on both speed and binary size.
So, some surprises, but all in all, this PR has positives.
One could still argue that, for XXH32
and XXH64
, dev
+ -Os
is actually the better choice, since performance is unaffected, while binary size is the smallest.
But for XXH128
, note that the larger binary size of -Os
comes with better performance too, so it's a reasonable trade off. In any case, pr802
+ -Os
is superior in all metrics to dev
+ -O2
, since it provides both smaller binary size and faster speed for XXH128
. That's worth mentioning.
Now, it was also mentioned that this PR is supposed to be even more useful for 32-bit mode, so let's try that. Once again, compiler is gcc
v9.4.0
, only performance flags are changed, meaning that vector extensions are likely disabled by default.
compiler flags | dev lib size |
pr802 lib size |
difference |
---|---|---|---|
-m32 -O3 |
111860 | 73174 | -38686 |
-m32 -O2 |
70924 | 46968 | -23956 |
-m32 -Os |
34876 | 34920 | +44 |
The first thing to note is how big these binary sizes are, compared to x64
binaries. This might be because there is no vector extension by default. This could be tested in more details by separating XXH3
variants from the rest.
The binary size decrease from pr802
is huge for the performance modes, especially at -O3
.
Once again, -Os
is against the trend, though the difference is pretty minimal in this case, that's essentially a tie.
Now, what about performance ?
compiler flags | dev GB/s |
pr802 GB/s |
note |
---|---|---|---|
-m32 -O3 |
6.9, 3.1, 5.7 | 6.9, 2.9, 5.6 | |
-m32 -O2 |
6.9, 2.8, 2.8 | 6.9, 2.7, 2.7 | |
-m32 -Os |
6.9, 2.7, 2.6 | 6.9, 2.6, 2.4 |
Once again, xxh32
proves very resilient, offering the same performance at all compilation modes, and note that it's the same performance as the x64
mode, despite the lower nb of registers.
We already knew that xxh64
is a bad fit for 32-bit mode, and this is confirmed again in these measurements, no surprise. Note though that in this case, -O3
does deliver some speed benefits (almost +10%), so it's no longer clear that -Os
is the better choice for XXH64
.
XXH128
performance is also much lower, most likely due to the disappearance of vector extensions. If SSE2
was still there, it would be a very different story. Anyway, in this case, -O3
improves the performance of XXH128
by a whopping +100%, so it really matters.
Finally, in these measurements, performance of pr802
seems slightly lower than dev
. Nothing extravagant, but worth mentioning.
It's difficult to draw conclusions for this compilation exercise. pr802
delivers significant size benefits for -O3
and -O2
, and the minor performance losses are probably acceptable. What's more concerning is the -Os
mode, for which pr802
is both slightly larger, and slightly slower. Nothing "serious", but still not in the right direction. And I initially expected this mode to be the most beneficial for pr802
, so clearly, there was some misunderstanding.
There are other axis that could be analyzed. For example, add an aarch64
test. And maybe a 32-bit armv7
too. Measure the binary size impact of XXH3
separately from legacy XXH32
and XXH64
. Add a few different vectors. Add a few different compilers. The test matrix quickly becomes overwhelming, and it's likely going to be difficult to produce a change that's only beneficial.
One potential idea : is it possible to break this PR into smaller stages ? Maybe some of these stages produce uncontroversial benefits that could be quickly merged ? Maybe other require pondering some trade-offs (which can still be acceptable) ? Finally, maybe there are useful learnings that could be used to improve other parts of the PR ? Open for discussion.
Also, I would recommend rebasing this PR on top of dev
, just to make sure that observed performance differences come solely from this PR, and are not confused with some parallel PR that would have been merged to dev
but not pr802
.
Since the focus of this PR is on binary size, I wanted to have a more in-depth look at these evolutions.
In the following tables, I'm separating the variants XXH32
, XXH64
and XXH128
, by producing limited dynamic library binaries (note: maybe I should have used static library size instead).
For XXH32
, the produced libxxhash
only contains XXH32
and some common functions.
For XXH64
, the produced libxxhash
has both previous content and XXH64
, but no XXH3
variant.
Finally, for XXH128
, it's the full library.
The hope is that it will make it possible to observe the binary size changes of XXH32
and XXH64
separately, without mixing the (probably larger) changes due to vector extensions exploited by XXH3
variants.
In the same vein, it's going to be more natural to add more vector extensions for comparisons.
Let's start this exercise with XXH32
:
algo | target | compiler | flags | dev size |
pr802 size |
lib size diff |
---|---|---|---|---|---|---|
XXH32 | x64 |
gcc v9.4.0 |
-O3 |
16576 | 16556 | -20 |
XXH32 | x64 |
gcc v9.4.0 |
-O2 |
16616 | 16616 | 0 |
XXH32 | x64 |
gcc v9.4.0 |
-Os |
16600 | 16664 | +64 |
XXH32 | x86 |
gcc v9.4.0 |
-m32 -O3 |
15456 | 15456 | 0 |
XXH32 | x86 |
gcc v9.4.0 |
-m32 -O2 |
15488 | 15488 | 0 |
XXH32 | x86 |
gcc v9.4.0 |
-m32 -Os |
15440 | 15532 | +92 |
XXH32 | aarch64 M1 |
Apple clang v14.0 |
-O3 |
33934 | 33934 | 0 |
XXH32 | aarch64 M1 |
Apple clang v14.0 |
-O2 |
33934 | 33934 | 0 |
XXH32 | aarch64 M1 |
Apple clang v14.0 |
-Os |
33966 | 34014 | +48 |
A few interesting details here.
To begin with, we know that performance remains largely unaffected for XXH32
, staying at 6.9 GB/s
on the i7-9700k
no-turbo target cpu, whatever the compilation mode and flags. The story is essentially the same on M1
, with XXH32
delivering consistently 7.0 GB/s
, and small (unimportant) degradation at pr802
with -Os
setting.
We also note that the library size differences between modes and commits remain extremely small, which is consistent.
Two remarquable trends are worth mentioning though :
- Modes expected to produce smaller binary sizes actually produce larger binary sizes. This trend is clearer for
pr802
. At the end of the story, the differences are not large, but that still goes against expectations. -
pr802
is especially worse at binary size for the-Os
mode. This is also the reverse of my initial expectations.
It could be that binary size differences are in fact unrelated to XXH32
proper, and due to minor differences in other generic symbols still compiled in the library, such as XXH_versionNumber()
for example.
In conclusion, when it comes to XXH32
, this pr802
is roughly neutral.
Now, onto XXH64
:
algo | target | compiler | flags | dev size |
pr802 size |
lib size diff | perf evol |
---|---|---|---|---|---|---|---|
XXH64 | x64 |
gcc v9.4.0 |
-O3 |
17072 | 17072 | 0 | 13.7 GB/s |
XXH64 | x64 |
gcc v9.4.0 |
-O2 |
17016 | 17016 | 0 | 13.7 GB/s |
XXH64 | x64 |
gcc v9.4.0 |
-Os |
17040 | 17144 | +104 | 13.7 GB/s |
XXH64 | x86 |
gcc v9.4.0 |
-m32 -O3 |
19840 | 19872 | +32 | 3.1 -> 2.9 GB/s |
XXH64 | x86 |
gcc v9.4.0 |
-m32 -O2 |
19900 | 15924 | -3976 | 2.8 -> 2.7 GB/s |
XXH64 | x86 |
gcc v9.4.0 |
-m32 -Os |
19852 | 15916 | -3936 | 2.7 -> 2.6 GB/s |
XXH64 | aarch64 M1 |
Apple clang v14.0 |
-O3 |
34414 | 34414 | 0 | 14.0 GB/s |
XXH64 | aarch64 M1 |
Apple clang v14.0 |
-O2 |
34414 | 34414 | 0 | 14.0 GB/s |
XXH64 | aarch64 M1 |
Apple clang v14.0 |
-Os |
34478 | 34718 | +304 | 14.0 -> 13.7 GB/s |
Some learnings:
In x64
mode, the library size difference is very small compared to the XXH32
-only library. Less than ~0.5KB is added. This is a similar story for aarch64
. This seems to show that XXH64
is compact enough, at all compilation settings.
As we already know, speed remains unaffected, at 13.7 GB/s
on i7-9700k
. It is only slightly degraded at -Os
with pr802
on M1 Pro
, still offering roughly double XXH32
speed.
Unfortunately, pr802
doesn't offer any binary size savings. Actually, -Os
is slightly bigger than dev
.
On x86
32-bit mode, there is a more substantial binary size difference, and dev
branch requires > +4 KB to add XXH64
. This suggests that the XXH64
algorithm gets expanded significantly. Probably some loop unrolling.
Binary size is improved by pr802
, showing that it's possible to add XXH64
with a budget of roughly ~0.5KB, similar to x64
mode.
We also know that XXH64
performance is much lower in 32-bit mode, due to reliance on 64-bit multiply operations.
And unfortunately, pr802
makes this situation slightly worse.
This would probably remain an acceptable trade-off when it saves almost ~4 KB of binary size, but strangely, at -O3
, there is no such binary size saving, yet the performance difference is still very much present (checked several times).
It's unclear to me why there is a (small) performance difference between dev
and pr802
in 32-bit x86
mode.
It's also unclear to me why binary size savings seem ineffective at -O3
specifically, while -O2
seems to benefit from it.
In conclusion, when it comes to XXH64
, this pr802
is slightly negative for performance. There is an interesting binary size saving on x86
at -O2
and -Os
settings, that could potentially compensate the small speed degradation. But for 64-bit
modes, there is no positive to report.
XXH128
:
For this last one, we can simply re-employ results from previous measurement exercise.
algo | target | compiler | flags | dev size |
pr802 size |
lib size diff | perf evol |
---|---|---|---|---|---|---|---|
XXH128 | x64 + SSE2 |
gcc v9.4.0 |
-O3 |
76392 | 60080 | -16312 | 20.1 GB/s |
XXH128 | x64 + SSE2 |
gcc v9.4.0 |
-O2 |
51936 | 43912 | -8024 | 14.0 GB/s |
XXH128 | x64 + SSE2 |
gcc v9.4.0 |
-Os |
32208 | 36600 | +4392 | 12.6 -> 14.9 GB/s |
XXH128 | x86 + scalar |
gcc v9.4.0 |
-m32 -O3 |
111860 | 75436 | -36424 | 5.7 -> 5.6 GB/s |
XXH128 | x86 + scalar |
gcc v9.4.0 |
-m32 -O2 |
70924 | 46968 | -23956 | 2.8 -> 2.7 GB/s |
XXH128 | x86 + scalar |
gcc v9.4.0 |
-m32 -Os |
34876 | 34920 | +44 | 2.2 -> 2.4 GB/s |
XXH128 | aarch64 M1 + NEON |
Apple clang v14.0 |
-O3 |
69678 | 53246 | -16432 | 35.8 GB/s |
XXH128 | aarch64 M1 + NEON |
Apple clang v14.0 |
-O2 |
69678 | 53246 | -16432 | 35.8 GB/s |
XXH128 | aarch64 M1 + NEON |
Apple clang v14.0 |
-Os |
37198 | 37198 | 0 | 10.9 GB/s |
To begin with, it's clear that library size increases significantly with XXH3
variants. Sure, there are 2 variants, divided into a number of range strategies, plus streaming capabilities, ok. Yet, I'm nonetheless surprised that it requires so much. The -Os
compilation mode is the only one that seems in the "acceptable" range.
Here, pr802
brings pretty big binary size savings, at -O2
and especially -O3
.
For -Os
though, it doesn't. It actually increases size for the x64
mode.
-O3
remains an important compilation flag for best performance of XXH128
on gcc
: speed drops very significantly with -O2
and -Os
.
On clang
, the picture is slightly different : speed is exactly the same at -O3
and -O2
. But at -Os
, the performance drop is very steep.
The performance impact of pr802
is all over the place, from neutral, to slightly negative, to positive.
The most interesting aspect is seeing -Os
mode receiving a speed boost on gcc
+ intel cpu.
For x64
, this speed boost is significant enough to beat -O2
, which is really nice since binary is also smaller.
For x86
, the speed boost is not as important, but still a hefty +10% compared to dev
.
So there's probably something worth investigating there.
To be continued : add more targets, more compilers, etc.
The previous analysis was using Dynamic Library to compare generates binary sizes. It was assumed that this wouldn't make much difference compared to Static Libraries, with maybe just a constant factor added. This is incorrect : Static and Dynamic Library sizes are loosely correlated, but can evolve somewhat differently. This makes it difficult to decide which one is worth monitoring. It could be argued that Static Library is closer to the concept of "generated binary size", so that would justify prefering Static library sizes. But this disregards the issue that these fragments of code, once integrated into a real executable (library or program) have to respect additional rules, such as code and data alignment. As a consequence, the impact of a change will be different depending on the target, and depending on the linking process. Therefore, it could also be argued that the Dynamic Library size gives a better sense of the final impact of a change.
This is relatively difficult to make sense of, and difficult to decide, so I left both analysis in the thread. We can continue from the Static or the Dynamic library size analysis, depending on preference.
Since the focus of this PR is on binary size, I wanted to have a more in-depth look at these evolutions.
In the following tables, I'm separating the variants XXH32
, XXH64
and XXH128
, by producing limited Static library binaries.
For XXH32
, the produced libxxhash
only contains XXH32
and some common functions.
For XXH64
, the produced libxxhash
has both previous content and XXH64
, but no XXH3
variant.
Finally, for XXH128
, it's the full library.
The hope is that it will make it possible to observe the binary size changes of XXH32
and XXH64
separately, without mixing the (probably larger) changes due to vector extensions exploited by XXH3
variants.
In the same vein, it's going to be more natural to add more vector extensions for comparisons.
Let's start this exercise with XXH32
:
algo | target | compiler | flags | dev size |
pr802 size |
lib size diff |
---|---|---|---|---|---|---|
XXH32 | x64 |
gcc v9.4.0 |
-O3 |
4268 | 4404 | +136 |
XXH32 | x64 |
gcc v9.4.0 |
-O2 |
4108 | 4228 | +120 |
XXH32 | x64 |
gcc v9.4.0 |
-Os |
3780 | 3836 | +56 |
XXH32 | x86 |
gcc v9.4.0 |
-m32 -O3 |
4412 | 4460 | +48 |
XXH32 | x86 |
gcc v9.4.0 |
-m32 -O2 |
4200 | 4248 | +48 |
XXH32 | x86 |
gcc v9.4.0 |
-m32 -Os |
3594 | 3590 | -4 |
XXH32 | aarch64 M1 |
Apple clang v14.0 |
-O3 |
3416 | 3432 | +16 |
XXH32 | aarch64 M1 |
Apple clang v14.0 |
-O2 |
3400 | 3416 | +16 |
XXH32 | aarch64 M1 |
Apple clang v14.0 |
-Os |
3392 | 3432 | +40 |
We know that XXH32
speed remains largely unaffected, staying at 6.9 GB/s
on the i7-9700k
no-turbo target cpu, whatever the compilation mode and flags. The story is essentially the same on M1
, with XXH32
delivering consistently 7.0 GB/s
, and a small (unimportant) degradation at pr802
with -Os
setting.
The library size differences between modes and commits remain small, which is unsurprising.
But we can also notice that pr802
impact is generally negative for binary size.
Not by a large amount, therefore it's not a big deal.
Since there is no corresponding speed benefit, pr802
looks globally slightly negative for the XXH32
-only scenario.
But as mentioned, this is minor, and can probably be ignored if there is a much larger benefit for other scenarios.
Now, onto XXH64
:
algo | target | compiler | flags | dev size |
pr802 size |
lib size diff | perf evol |
---|---|---|---|---|---|---|---|
XXH64 | x64 |
gcc v9.4.0 |
-O3 |
7526 | 7886 | +360 | 13.7 GB/s |
XXH64 | x64 |
gcc v9.4.0 |
-O2 |
7126 | 7446 | +320 | 13.7 GB/s |
XXH64 | x64 |
gcc v9.4.0 |
-Os |
6326 | 6406 | +80 | 13.7 GB/s |
XXH64 | x86 |
gcc v9.4.0 |
-m32 -O3 |
10638 | 9866 | -772 | 3.1 -> 2.9 GB/s |
XXH64 | x86 |
gcc v9.4.0 |
-m32 -O2 |
9110 | 7814 | -1296 | 2.8 -> 2.7 GB/s |
XXH64 | x86 |
gcc v9.4.0 |
-m32 -Os |
8048 | 6472 | -1576 | 2.7 -> 2.6 GB/s |
XXH64 | aarch64 M1 |
Apple clang v14.0 |
-O3 |
5896 | 5912 | +16 | 14.0 GB/s |
XXH64 | aarch64 M1 |
Apple clang v14.0 |
-O2 |
5832 | 5848 | +16 | 14.0 GB/s |
XXH64 | aarch64 M1 |
Apple clang v14.0 |
-Os |
5848 | 6128 | +280 | 14.0 -> 13.7 GB/s |
Some learnings:
As we already know, in 64-bit modes (x64
, aarch64
), speed remains unaffected, at 13.7 GB/s
on i7-9700k
. It is only slightly degraded at -Os
with pr802
on M1 Pro
, still offering roughly double XXH32
speed.
Unfortunately, pr802
doesn't offer any binary size savings for this scenario. It is actually slightly bigger than dev
.
On x86
32-bit mode, there is a more substantial binary size difference, and dev
branch requires 2x binary size budget to add XXH64
. This suggests that the XXH64
algorithm gets expanded significantly. Probably some loop unrolling.
Binary size is improved by pr802
, showing that it's possible to add XXH64
with a budget roughly similar to x64
mode.
We also know that XXH64
performance is much lower in 32-bit mode, due to reliance on 64-bit multiply operations.
And unfortunately, pr802
makes this situation slightly worse.
This would probably remain an acceptable trade-off when it saves binary size, but strangely, at -O3
, binary size savings are not great, yet the performance difference is very much present (checked multiple times).
It's unclear to me why there is a (small) performance difference between dev
and pr802
in 32-bit x86
mode.
It's also unclear to me why binary size savings is less effective at -O3
specifically, while -O2
seems to benefit more from it.
In conclusion, when it comes to XXH64
, this pr802
seems slightly negative for performance. There is an interesting binary size saving on x86
, especially at -O2
and -Os
settings, that could potentially compensate the small speed degradation. But for 64-bit
modes, there is unfortunately no positive outcome to report .
XXH128
:
Let's redo binary size measurements, using static
library size instead.
algo | target | compiler | flags | dev size |
pr802 size |
lib size diff | perf evol |
---|---|---|---|---|---|---|---|
XXH128 | x64 + SSE2 |
gcc v9.4.0 |
-O3 |
67232 | 53200 | -14032 | 20.1 GB/s |
XXH128 | x64 + SSE2 |
gcc v9.4.0 |
-O2 |
40880 | 36864 | -4016 | 14.0 GB/s |
XXH128 | x64 + SSE2 |
gcc v9.4.0 |
-Os |
19752 | 20432 | +680 | 12.6 -> 14.9 GB/s |
XXH128 | x86 + scalar |
gcc v9.4.0 |
-m32 -O3 |
107280 | 73174 | -34106 | 5.7 -> 5.6 GB/s |
XXH128 | x86 + scalar |
gcc v9.4.0 |
-m32 -O2 |
60796 | 39188 | -21608 | 2.8 -> 2.7 GB/s |
XXH128 | x86 + scalar |
gcc v9.4.0 |
-m32 -Os |
25152 | 23278 | -1878 | 2.2 -> 2.4 GB/s |
XXH128 | aarch64 M1 + NEON |
Apple clang v14.0 |
-O3 |
50456 | 44408 | -6048 | 35.8 GB/s |
XXH128 | aarch64 M1 + NEON |
Apple clang v14.0 |
-O2 |
48768 | 42248 | -6520 | 35.8 GB/s |
XXH128 | aarch64 M1 + NEON |
Apple clang v14.0 |
-Os |
21600 | 24336 | +2736 | 10.9 GB/s |
To begin with, it's clear that library size increases significantly with XXH3
variants. Sure, there are 2 variants, divided into a number of range strategies, plus streaming capabilities, ok. Yet, I'm nonetheless surprised that it requires so much. The -Os
compilation mode is the only one that lies in the "acceptable" range.
Here, pr802
brings pretty big binary size savings, at -O2
and especially -O3
.
For -Os
though, it's less clear, and can actually increase size.
-O3
remains an important compilation flag for best performance of XXH128
on gcc
: speed drops very significantly with -O2
and -Os
.
On clang
, the picture is slightly different : speed is exactly the same at -O3
and -O2
. But at -Os
, the performance drop is very steep.
The performance impact of pr802
is all over the place, from neutral, to slightly negative, to positive.
The most interesting aspect is seeing -Os
mode receiving a speed boost on gcc
+ intel cpu.
For x64
, this speed boost is significant enough to beat -O2
, which is really nice since binary size is also smaller.
For x86
, the speed boost is not as important, but still a hefty +10% compared to dev
.
So there's probably something worth investigating there.
Note : some results are sometimes surprising. Ensuring that pr802
is based on top of dev
is also important to ensure this comparison analysis doesn't produce wrong conclusions.
To be continued : add more targets, more compilers, etc.
is it possible to break this PR into smaller stages ?
Sure, I can do that. Sorry I adhd'd again 😵💫
I think that the isolated changes that are not going to be controversial are:
- Force GCC 12 to -O3 because there doesn't seem to be a solution for -Og without ruining everything else
- Kill XXH_OLD_NAMES
- Rework the XXH3_update code to not dupe itself so much
- Force inline the lowest level functions on
-Os
. However, perhaps don't do it for anything aboveround
oracc512
, and don't use noinline either. - Reduce the unrolling and inlining on 32-bit (but with some more benchmarking)
- Outline (and perhaps label) the individual steps better
The future changes could be
a. Implement the hashLong refactor
b. Improve the 4 and 5 selections on testing
c. If we are going to be reducing inlining, perhaps we could sneak in a __attribute__((regparm(3)))
on x86 to make tail calls better?
the changes are beneficial to binary sizes for the performance mode, but when it comes to the size optimized -Os mode, binary size is actually increased.
Yes. That is expected.
On 64-bit targets, the size decrease on -O3
comes from XXH3_update
, which has been refactored and is no longer inlined twice. There haven't been that many changes otherwise.
The size increase on -Os
is expected too, as I have changed it to inline the utilities and core functions. However, 4kB is a little more than I want — perhaps that comes from the XXH_NO_INLINE
clones.
This change is because -Os
is a pretty common default optimization level in various projects, and the aggressive size optimizations should be more opt-in. This is actually the reason Clang added -Oz
, -Os
was originally an alias for -O2
because people liked to overuse it.
The performance issues were noticeable more on GCC ARMv4, where it outlined XXH_memcpy()
for unaligned reads.
Although I will have to look into fixing the GCC x86 performance issues. This might just be from the XXH_INLINE_64BIT
not inlining the right functions since the performance drop seems to be consistent.
To begin with, it's clear that library size increases significantly with XXH3 variants. Sure, there are 2 variants, divided into a number of range strategies, plus streaming capabilities, ok. Yet, I'm nonetheless surprised that it requires so much. The -Os compilation mode is the only one that seems in the "acceptable" range.
The way it is force inlined makes there be 8.
You can easily see the size of each function if you compile with -ffunction-sections
, and use size -A xxhash.o | sort -nk2
.
The short hashing code (XXH3_[64,128]bits_internal) is deceptively heavy on 32-bit. When compiled with XXH_NO_INLINE_HINTS
, it is usually outlined and can be 1-2k itself on 64-bit targets, and 5-10k on 32-bit.
However, arguably, there is some benefit to this, as the short hash is more sensitive to latency, and the constant propagation of kSecret
and seed==0
can be beneficial (especially on platforms with limited unaligned access since all secret reads are aligned).
Additionally, as I mentioned before, the way hashLong is set up causes inline clones as well, which afaik doesn't seem to be beneficial now that the codepaths are identical.
Another note is that on x86, the 64->128-bit multiply is a lot of instructions because of how disruptive the EAX:EDX
requirements are. Both GCC and Clang seem to be unaware of this code weight and inline/unroll 60+ bytes of code each multiply. (I might make an XXH_NO_INLINE for all 32-bit but ARM with DSP) Also, GCC 10+ on x86 emit a literal imul reg, reg, 0
which means I might need to hack mult32to64 on that too.
The first thing I'll do is find the minimum amount of #pragma GCC optimize("-O3")
or __attribute__((optimize("-O3")))
to shut up GCC
I have a rough draft of the hashLong refactor that is designed to be dispatched, but update()
performance is pretty bad due to the function call latency (since consumeStripes() would need a __attribute__((target))
). I might have to extract the entire second half of XXH3_update()
. Changes start at ~line 5000.
https://gist.github.com/easyaspi314/795084ca1c04d4c884e6cff14cab80a4
I believe there were already several fixes merged for the gcc-12 -Og
issue.
Therefore, is this PR still useful ? Or can it be closed ?
The "fix gcc-12 -Og
compilation" issue seems fixed now, thanks to other contributions.
However, the PR also lists "improve code size" as an objective. Is there still something worth extracting from it ?
I compared dev ( 258351b ) and this PR ( 98be0b9 )
Note that this report contains all XXH algorithms.
algo | target | compiler | flags | dev size | pr802 size | lib size diff (pr802 - dev) |
---|---|---|---|---|---|---|
all | x64 + SSE2 | gcc v11.3.0 | -O3 | 51176 | 52888 | +1712 |
all | x64 + SSE2 | gcc v11.3.0 | -O2 | 36752 | 37568 | +816 |
all | x64 + SSE2 | gcc v11.3.0 | -Os | 19384 | 20432 | +1048 |
all | x86 + scalar | gcc v11.3.0 | -m32 -O3 | 89016 | 73770 | -15246 |
all | x86 + scalar | gcc v11.3.0 | -m32 -O2 | 52940 | 40494 | -12446 |
all | x86 + scalar | gcc v11.3.0 | -m32 -Os | 24396 | 22734 | -1662 |
We can still see some size difference in x86 mode.
CSV
commit-hash,compiler,flags,filename,size
258351b,gcc-11,-O2,libxxhash.a,36752
258351b,gcc-11,-O3,libxxhash.a,51176
258351b,gcc-11,-Os,libxxhash.a,19384
258351b,gcc-11,-m32 -O2,libxxhash.a,52940
258351b,gcc-11,-m32 -O3,libxxhash.a,89016
258351b,gcc-11,-m32 -Os,libxxhash.a,24396
98be0b9,gcc-11,-O2,libxxhash.a,37568
98be0b9,gcc-11,-O3,libxxhash.a,52888
98be0b9,gcc-11,-Os,libxxhash.a,20432
98be0b9,gcc-11,-m32 -O2,libxxhash.a,40494
98be0b9,gcc-11,-m32 -O3,libxxhash.a,73770
98be0b9,gcc-11,-m32 -Os,libxxhash.a,22734
Procedure
cd
git clone https://github.com/Cyan4973/xxHash.git xxhash-pr-802
cd xxhash-pr-802
mkdir report
export CC="gcc-11"
git checkout dev
export CH=$(git --no-pager log -1 --format="%h")
export CFLAGS="-O3" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"
export CFLAGS="-O2" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"
export CFLAGS="-Os" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"
export CFLAGS="-m32 -O3" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"
export CFLAGS="-m32 -O2" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"
export CFLAGS="-m32 -Os" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"
git pull origin pull/802/head
git checkout FETCH_HEAD
git log -1
export CH=$(git --no-pager log -1 --format="%h")
export CFLAGS="-O3" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"
export CFLAGS="-O2" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"
export CFLAGS="-Os" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"
export CFLAGS="-m32 -O3" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"
export CFLAGS="-m32 -O2" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"
export CFLAGS="-m32 -Os" && make clean all && stat -c "$CH,$CC,$CFLAGS,%n,%s" libxxhash.a > "report/$CH-$CC-$CFLAGS"
# Show CSV
cat report/*
As for code size, c6dc92f also reduces it and may implement more robust dispatch functionality.
As for code size, c6dc92f also reduces it and may implement more robust dispatch functionality.
I should go back and finish that. I kinda got distracted by another project 😅
I should go back and finish that. I kinda got distracted by another project 😅
Ideally, I would like to publish release v0.8.2
in the very near future.
I presume completing this objective is not trivial, and therefore should rather be expected as a potential contribution to a later version ? (either v0.8.3
or v0.9.0
)
I should go back and finish that. I kinda got distracted by another project 😅
Ideally, I would like to publish release
v0.8.2
in the very near future.I presume completing this objective is not trivial, and therefore should rather be expected as a potential contribution to a later version ? (either
v0.8.3
orv0.9.0
)
Looking over the code I think it just needs some clean up, documentation, and benchmarking.