BrotliSharpLib icon indicating copy to clipboard operation
BrotliSharpLib copied to clipboard

How do those improvements work?

Open rikimaru0345 opened this issue 6 years ago • 3 comments

Hello, I was looking through the commits and noticed two commits: https://github.com/master131/BrotliSharpLib/commit/1c452a9fea1551c6bf80991636a17443bb13994d and https://github.com/master131/BrotliSharpLib/commit/429a82aeb97ceaf105751e2fed7c4f9ad915df66

The first one changed some code, and I'd like to ask what changes were made here, and more importantly how exactly that improved performance? What was the big bottleneck here? I understand the code (at least somewhat), but the diffs are a bit mangled and hard to follow.

The second one surprises me as well, how were those aggressive inlining attributes identified as harmful to the performance? Did you manually comment them out and re-run the benchmarks? (I doubt it a bit, that'd be a somewhat "blind" approach, no??) Did you inspect the generated asm code and identified some issues there?

Great work on the library! I love it! :)

rikimaru0345 avatar Jul 31 '18 01:07 rikimaru0345

I performed some benchmarks and noticed that the stack allocations were impacting performance. When I use standard arrays which are allocated on the heap, it performed better.

As for removing the aggressive inlining, basically when those functions were inlined, the caller would run out of registers, so heaps of CPU cycles were wasted swapping variables in and out of registers from the stack, rather than doing meaningful calcuations.

So yes, it was a combination of benchmarks and identifying the JITed assembly code.

master131 avatar Jul 31 '18 02:07 master131

@master131 stackallocs can be boosted by up to 30% by disabling zeroing see https://github.com/dotnet/coreclr/issues/1279 .NET Core uses this technique for some libs including mscorlib via an additional linker (ILLINK) step "ClearLocalsInit" (see my tweet https://twitter.com/EgorBo/status/1066662708504903681).

Also the heap allocations can be cached via ArrayPool ;-)

Great work by the way! On my machine it's only 1.5x slower than .NET Core implementation (with 1.0.6 native lib) for level 11 and a 2mb PDF file

EgorBo avatar Dec 23 '18 23:12 EgorBo

Thanks for bringing that to my attention @EgorBo. As an experiment I removed .initlocals and re-instated the stackalloc to see if performance was any better than the v0.3.2.

On .NET Framework 4.5: v0.3.2 had a benchmark time of 1.788s. The initlocals/stackalloc experiment was worse, with a time of 2.172s.

On .NET Core 2.2: v0.3.2 had a benchmark time of 1.625s. The initlocals/stackalloc experiment was better, with a time of 1.594s.

I did also check the JITed assembly to verify that it was not being zero initialized. For simplicity reasons I may just leave the allocation as-is (as the performance increase in .NET Core is negligible) and not mess around with micro-optimisations.

master131 avatar Dec 24 '18 04:12 master131