zlib
zlib copied to clipboard
Add optimized slide_hash for Power processors
Hi,
During performance tests, we noticed that slide_hash consumes considerable CPU during compression on Power processors. This PR introduces an optimized version using VSX vector instructions to make it faster. The main difference is that it slides 8 elements at a time, instead of just one as the standard code does.
The implementation uses GNU indirect function (ifunc) feature to choose the correct function version to be used on the first call during runtime. Later calls will all go directly to the selected function. This way, the same binary can be used for all Power processor versions. The ifunc helper code, however, is not limited to Power, and can be reused by other archs if wanted, so it was placed under contrib/gcc
.
I tried to make as few changes as possible to top-level files (deflate.c
), and instead place most Power-specific code under contrib/power
.
To measure the performance improvement, we used Chromium's zlib_bench.cc with input files from jsnell/zlib-bench.
The results below show compression throughput in MB/s using RAW deflate, for all compression levels:
-
jpeg
comp lvl default optimized gain 1 20.4 27.4 +34.31% 2 20.2 26.4 +30.69% 3 20.2 27.1 +34.16% 4 20.3 27.3 +34.48% 5 20.3 27.3 +34.48% 6 20.3 27.3 +34.48% 7 20.3 27.3 +34.48% 8 20.3 27.3 +34.48% 9 20.3 27.3 +34.48% -
pngpixels
comp lvl default optimized gain 1 67.0 98.6 +47.16% 2 58.7 79.8 +35.95% 3 38.8 46.7 +20.36% 4 42.1 48.8 +15.91% 5 26.6 29.2 +9.77% 6 13.8 14.5 +5.07% 7 8.9 9.2 +3.37% 8 2.8 2.8 +0.00% 9 1.3 1.3 +0.00% -
executable
comp lvl default optimized gain 1 41.3 57.6 +39.47% 2 37.9 50.9 +34.30% 3 29.0 36.1 +24.48% 4 28.4 34.8 +22.54% 5 20.2 23.2 +14.85% 6 12.5 13.7 +9.60% 7 9.5 10.1 +6.32% 8 5.4 5.6 +3.70% 9 4.1 4.2 +2.44% -
html
comp lvl default optimized gain 1 43.1 59.3 +37.59% 2 38.6 50.7 +31.35% 3 27.8 33.8 +21.58% 4 28.3 33.1 +16.96% 5 18.1 20.1 +11.05% 6 12.2 13.0 +6.56% 7 10.6 11.2 +5.66% 8 8.0 8.4 +5.00% 9 7.9 8.3 +5.06%
Force push to add changes to feature detection on configure
.