zstd icon indicating copy to clipboard operation
zstd copied to clipboard

compress:check more bytes to reduce ZSTD_count call

Open JunHe77 opened this issue 2 years ago • 1 comments

Signed-off-by: Jun He [email protected] Change-Id: I3449ea423d5c8e8344f75341f19a2d1643c703f6

JunHe77 avatar Jul 18 '22 03:07 JunHe77

This patch changes the comparison in the loop of longest match from 1B to 4B.

This extended comparison is safe as we compare the byte starting from match[3] and ip[3], and later move to match[ml] and ip[ml] when ml is bigger than 3 in the original implementation. So changing to 4B guarantees the reading of match and ip are always in a valid range.

Comparing 4B will skip those sequences that don't match {ip[ml+1-sizeof(U32)]..ip[ml-1]}, but happen to match at ip[ml], so the ZSTD_count call can be reduced.

The benchmark on Arm shows good uplift with this change (up to 8% on N1 and 11% on M1, at L8) with both gcc and clang.

JunHe77 avatar Jul 18 '22 05:07 JunHe77

@terrelln , @embg , anything I need to update on this PR? Thanks.

JunHe77 avatar Aug 31 '22 02:08 JunHe77

@terrelln , @embg , anything I need to update on this PR? Thanks.

Sorry for the delay, I just need to run benchmarks. Will do it tomorrow.

Edit: have been working through a huge backlog, sorry again for the delay. Will prioritize the benchmarking this week.

embg avatar Aug 31 '22 02:08 embg

I see 2 levels of optimization bundled in this PR :

  • Comparing a 4-bytes trailing U32, instead of a single byte, as explained in the PR
  • Saving the need to read *(ip+ml-3) by caching it into a variable ref

The first level is very simple, and brings most of the benefits (>90 %). It can be done by 2 simple line changes : replace if (match[ml] == ip[ml]) by if (MEM_read32(match + ml - 3) == MEM_read32(ip + ml - 3)). It's all well contained, trivial to read and maintain, and offers a decent speed benefit at higher compression levels (L8+).

The second level is more complex : it requires to maintain a state, ref, which is updated in one place, and used in another. Moreover, the speed benefits it offers are questionable : it seems positive, over a large enough nb of tests, but it's small enough to be within noise margin. However, the maintenance load is higher, it makes updating this section of the code more difficult in the future. If we were talking about the fast mode (zstd_fast.c), the value of it could be debated, as speed matters a lot. But since we are talking about middle levels of performance, they are all trade-off, and maintenance complexity is part of the trade off. So I would recommend to simply skip this part.

This would make this patch much leaner, essentially a 2-lines change, with no distance interaction, while delivering > 90% of the speed gains. A no brainer for merge.

Cyan4973 avatar Sep 13 '22 17:09 Cyan4973

This would make this patch much leaner, essentially a 2-lines change, with no distance interaction, while delivering > 90% of the speed gains. A no brainer for merge.

Good point @Cyan4973. I agree that the long-distance variables aren't worth the small additional perf win. @JunHe77 can you please update along the lines suggested by myself and @Cyan4973?

embg avatar Sep 14 '22 18:09 embg

@embg , @Cyan4973 , thanks for the comments. Patch has been updated to reflect these.

JunHe77 avatar Sep 18 '22 07:09 JunHe77