zlib
zlib copied to clipboard
Add optimized longest_match for Power processors
Hello again,
This optimization uses VSX vector (SIMD) instructions to try to match multiple bytes at the same time during the search for the longest match. A vector load + comparison (16 bytes) has just a small overhead if compared to their regular versions, so the optimized longest_match tries to match as many bytes as possible on every comparison.
This PR shares 1 commit with #457 and #458, which can be removed if either one gets merged first. It also uses GNU indirect functions to choose which function version (optimized or default) to run on the first call to longest_match during runtime.
To test the performance improvement, we used Chromium's zlib_bench.cc with input files from jsnell/zlib-bench.
The results below show compression throughput in MB/s using RAW deflate, for all compression levels:
-
pngpixels
comp lvl default optimized gain 1 67.5 73.0 +8.15% 2 59.0 65.3 +10.68% 3 38.8 45.2 +16.49% 4 42.0 46.0 +9.52% 5 26.7 31.6 +18.35% 6 13.8 16.5 +19.57% 7 8.9 10.6 +19.10% 8 2.8 3.4 +21.43% 9 1.3 1.5 +15.38% -
jpeg
comp lvl default optimized gain 1 20.0 20.5 +2.50% 2 20.2 20.3 +0.50% 3 20.2 20.3 +0.50% 4 20.3 20.4 +0.49% 5 20.3 20.4 +0.49% 6 20.3 20.4 +0.49% 7 20.3 20.4 +0.49% 8 19.9 20.4 +2.51% 9 20.3 20.4 +0.49% -
executable
comp lvl default optimized gain 1 41.2 43.1 +4.61% 2 37.8 39.2 +3.70% 3 28.9 29.9 +3.46% 4 28.3 28.9 +2.12% 5 20.2 21.4 +5.94% 6 12.5 13.1 +4.80% 7 9.5 9.9 +4.21% 8 5.4 5.6 +3.70% 9 4.1 4.2 +2.44% -
html
comp lvl default optimized gain 1 43.0 46.2 +7.44% 2 38.5 42.2 +9.61% 3 27.8 30.8 +10.79% 4 28.3 30.8 +8.83% 5 18.1 20.1 +11.05% 6 12.2 13.2 +8.20% 7 10.6 11.4 +7.55% 8 8.0 8.7 +8.75% 9 7.9 8.6 +8.86%
Force push to add changes to feature detection on configure
.