zstd
zstd copied to clipboard
compress:check more bytes to reduce ZSTD_count call
Signed-off-by: Jun He [email protected] Change-Id: I3449ea423d5c8e8344f75341f19a2d1643c703f6
This patch changes the comparison in the loop of longest match from 1B to 4B.
This extended comparison is safe as we compare the byte starting from match[3] and ip[3], and later move to match[ml] and ip[ml] when ml is bigger than 3 in the original implementation. So changing to 4B guarantees the reading of match and ip are always in a valid range.
Comparing 4B will skip those sequences that don't match {ip[ml+1-sizeof(U32)]..ip[ml-1]}, but happen to match at ip[ml], so the ZSTD_count call can be reduced.
The benchmark on Arm shows good uplift with this change (up to 8% on N1 and 11% on M1, at L8) with both gcc and clang.
@terrelln , @embg , anything I need to update on this PR? Thanks.
@terrelln , @embg , anything I need to update on this PR? Thanks.
Sorry for the delay, I just need to run benchmarks. Will do it tomorrow.
Edit: have been working through a huge backlog, sorry again for the delay. Will prioritize the benchmarking this week.
I see 2 levels of optimization bundled in this PR :
- Comparing a 4-bytes trailing
U32
, instead of a single byte, as explained in the PR - Saving the need to read
*(ip+ml-3)
by caching it into a variableref
The first level is very simple, and brings most of the benefits (>90 %).
It can be done by 2 simple line changes : replace if (match[ml] == ip[ml])
by if (MEM_read32(match + ml - 3) == MEM_read32(ip + ml - 3))
.
It's all well contained, trivial to read and maintain, and offers a decent speed benefit at higher compression levels (L8+).
The second level is more complex : it requires to maintain a state, ref
, which is updated in one place, and used in another. Moreover, the speed benefits it offers are questionable : it seems positive, over a large enough nb of tests, but it's small enough to be within noise margin. However, the maintenance load is higher, it makes updating this section of the code more difficult in the future.
If we were talking about the fast mode (zstd_fast.c
), the value of it could be debated, as speed matters a lot. But since we are talking about middle levels of performance, they are all trade-off, and maintenance complexity is part of the trade off.
So I would recommend to simply skip this part.
This would make this patch much leaner, essentially a 2-lines change, with no distance interaction, while delivering > 90% of the speed gains. A no brainer for merge.
This would make this patch much leaner, essentially a 2-lines change, with no distance interaction, while delivering > 90% of the speed gains. A no brainer for merge.
Good point @Cyan4973. I agree that the long-distance variables aren't worth the small additional perf win. @JunHe77 can you please update along the lines suggested by myself and @Cyan4973?
@embg , @Cyan4973 , thanks for the comments. Patch has been updated to reflect these.