aspnetcore
aspnetcore copied to clipboard
Removing unsafe code in BufferExtensions
Removing unsafe code in BufferExtensions
Removing unsafe code, using Span to write numbers directly to the buffer
Description
- Removing the
unsafekeyword required for pinning - Removing the
fixedkeyword doing the pinning - Using
Span<T>to write directly to the buffer - Adding a testcase for values smaller to 1000, 100 and 10.
- As a result for values <1000, the JIT generates smaller code
- I only seem to find usage of this in KnowHeaders's generated code
Fixes #56534
Algorithmically this look fine; I want to quickly investigate the performance impact of not eliding the bounds check - will do that today; if that turns out to be relevant, there is a third option that removes the fixed and unsafe, but still elides; checking...
some speculative benchmarks, just intended to compare the optimized 1-3 character paths, ignoring the custom BufferWriter<> etc
Fixed= the original codeSpan= the code proposed in this PRRefAdd= the third way I mentioned above
results (code here):
BenchmarkDotNet v0.14.0, Windows 11 (10.0.26120.1350)
AMD Ryzen 9 7900X, 1 CPU, 24 logical and 12 physical cores
.NET SDK 8.0.203
[Host] : .NET 8.0.7 (8.0.724.31311), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
DefaultJob : .NET 8.0.7 (8.0.724.31311), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
| Method | Number | Mean | Error | StdDev | Median |
|------- |------- |----------:|----------:|----------:|----------:|
| Fixed | 8 | 0.6012 ns | 0.0042 ns | 0.0037 ns | 0.6014 ns |
| Span | 8 | 0.5640 ns | 0.0113 ns | 0.0206 ns | 0.5706 ns |
| RefAdd | 8 | 0.5567 ns | 0.0110 ns | 0.0263 ns | 0.5670 ns |
| Fixed | 16 | 1.0129 ns | 0.0087 ns | 0.0082 ns | 1.0096 ns |
| Span | 16 | 0.7861 ns | 0.0068 ns | 0.0063 ns | 0.7857 ns |
| RefAdd | 16 | 0.7861 ns | 0.0042 ns | 0.0039 ns | 0.7868 ns |
| Fixed | 143 | 1.4058 ns | 0.0130 ns | 0.0122 ns | 1.4102 ns |
| Span | 143 | 1.2048 ns | 0.0153 ns | 0.0143 ns | 1.2042 ns |
| RefAdd | 143 | 1.1970 ns | 0.0100 ns | 0.0089 ns | 1.1949 ns |
I welcome eyeballs / re-runs of these measures (especially on net9, sorry only had net8 to hand), but based on this I'm tempted to say "let's go the third way" (RefAdd) - this is justified by the existing checks against the span length. It seems marginally faster than the other two options, avoids explicit fixed / unsafe, and achieves bounds-check elision.
Specifically, this uses managed reference addition:
ref byte start = ref MemoryMarshal.GetReference(span);
// ...
start = (byte)(((uint)number) + AsciiDigitStart);
buffer.Advance(1);
// ...
start = (byte)(tens + AsciiDigitStart);
Unsafe.AddByteOffset(ref start, 1) = (byte)(val - (tens * 10) + AsciiDigitStart);
buffer.Advance(2);
// ...
start = (byte)(digit0 + AsciiDigitStart);
Unsafe.AddByteOffset(ref start, 1) = (byte)(digits01 - (digit0 * 10) + AsciiDigitStart);
Unsafe.AddByteOffset(ref start, 2) = (byte)(val - (digits01 * 10) + AsciiDigitStart);
buffer.Advance(3);
Thoughts?
I had similar benchmark results, I can share later today the exact numbers, but the Span one was slightly better jitted (code size was smaller).
Happy to use Unsafe.AddByteOffset, however there the results are within the error to Span (correction: marginally better). I can take a look at jitted code to compare later.
If it is a compromise between JIT size and performance, then my default reaction would be "choose performance"... maybe others disagree, though?
For completeness, I also investigated an Evil approach that used punning to ushort to reduce two writes to one ushort write in the 2/3 digit cases; it didn't go well - bizarrely impacting even the single-digit bench:
| Method | Number | Mean | Error | StdDev |
|------- |------- |----------:|----------:|----------:|
| Fixed | 8 | 0.6104 ns | 0.0078 ns | 0.0073 ns |
| Span | 8 | 0.5672 ns | 0.0110 ns | 0.0143 ns |
| RefAdd | 8 | 0.5653 ns | 0.0112 ns | 0.0193 ns |
| Evil | 8 | 0.9200 ns | 0.0184 ns | 0.0461 ns |
| Fixed | 16 | 1.0245 ns | 0.0133 ns | 0.0125 ns |
| Span | 16 | 0.7898 ns | 0.0038 ns | 0.0036 ns |
| RefAdd | 16 | 0.7887 ns | 0.0067 ns | 0.0060 ns |
| Evil | 16 | 0.9596 ns | 0.0069 ns | 0.0065 ns |
| Fixed | 143 | 1.4168 ns | 0.0138 ns | 0.0123 ns |
| Span | 143 | 1.2000 ns | 0.0144 ns | 0.0134 ns |
| RefAdd | 143 | 1.1963 ns | 0.0133 ns | 0.0125 ns |
| Evil | 143 | 1.1853 ns | 0.0094 ns | 0.0083 ns |
@mgravell I agree, just curious on the reasons behind it.
These are the results and code on my machine (I dropped the non-fast path branch compared to your tests):
BenchmarkDotNet v0.14.0, Windows 11 (10.0.22631.4037/23H2/2023Update/SunValley3)
12th Gen Intel Core i7-1255U, 1 CPU, 12 logical and 10 physical cores
.NET SDK 9.0.100-preview.7.24407.12
[Host] : .NET 9.0.0 (9.0.24.40507), X64 RyuJIT AVX2
DefaultJob : .NET 9.0.0 (9.0.24.40507), X64 RyuJIT AVX2
| Method | Value | Mean | Error | StdDev | Code Size | Allocated |
|----------- |------ |----------:|----------:|----------:|----------:|----------:|
| TestUnsafe | 6 | 0.1193 ns | 0.0160 ns | 0.0133 ns | 215 B | - |
| TestSpan | 6 | 0.1114 ns | 0.0105 ns | 0.0093 ns | 186 B | - |
| TestRef | 6 | 0.1130 ns | 0.0133 ns | 0.0125 ns | 186 B | - |
| TestUnsafe | 17 | 0.3625 ns | 0.0168 ns | 0.0149 ns | 212 B | - |
| TestSpan | 17 | 0.1063 ns | 0.0090 ns | 0.0075 ns | 183 B | - |
| TestRef | 17 | 0.1078 ns | 0.0091 ns | 0.0081 ns | 183 B | - |
| TestUnsafe | 123 | 0.6305 ns | 0.0146 ns | 0.0137 ns | 218 B | - |
| TestSpan | 123 | 0.3738 ns | 0.0116 ns | 0.0103 ns | 181 B | - |
| TestRef | 123 | 0.3548 ns | 0.0087 ns | 0.0077 ns | 181 B | - |
EDIT: To me the JITTED code for the Span and for the AddByteOffset looks identical.
I have updated the PR as you suggested.
You both know more about this than I do, but my outside opinion is that the perf looks about the same (between span and addref), so I'd pick whichever results in more readable code.
perf looks about the same (between span and addref), so I'd pick whichever results in more readable code.
Further: when the aim is to remove unsafe code, we shouldn't move to Unsafe which isn't any safer than unsafe. In fact the "by ref arithmetics" can lead to hard to find and debug GC holes (happened in runtime repo a few times), so the move should be to Span<T>.
When there's some perf regression that should get investigated and fixed (in runtime). So not only ASP.NET Core can profit from "better span", but rather all consumers of .NET.
@gfoidl @amcasey yes absolutely the simpler usage should be preferred by default, but in this case I'd argue that the .Length checks justify the limited use of Unsafe, and that this is a pretty common set of elisions for such scenarios - but if people prefer the code as proposed; that's fine too. My views may be biased by absurd library code ;p
Sorry @ladeak - not trying to mess you around with "do, undo" - maybe hold off on editing anything while we reach a consensus, but it sounds like the simpler code might be preferred.
@mgravell , no need to be sorry, I am happy with both ways.
@mgravell I'm in ❤️ with unsafe, hacks, whatever gives perf also, but from a .NET-PoV the JIT should be teached to avoid / elide these checks, so that the whole .NET ecosystem can profit from these improvements.
So I'd go with span, log an issue in runtime so that the perf-discrepancy can be investigated and fixed over there. If this PR is on hold until the JIT got fixed or not is another story (I'd keep it on hold, so that there's no regression in shipped product).
So I did a bench with bytesLeftInBlock removed and the span.Length checks moved inline with the if / else if tests (Span2):
| Method | Number | Mean | Error | StdDev |
|------- |------- |----------:|----------:|----------:|
| Fixed | 8 | 0.5952 ns | 0.0042 ns | 0.0040 ns |
| Span | 8 | 0.5462 ns | 0.0108 ns | 0.0272 ns |
| Span2 | 8 | 0.5513 ns | 0.0110 ns | 0.0190 ns |
| Fixed | 16 | 1.0047 ns | 0.0056 ns | 0.0050 ns |
| Span | 16 | 0.7688 ns | 0.0035 ns | 0.0032 ns |
| Span2 | 16 | 0.7709 ns | 0.0036 ns | 0.0032 ns |
| Fixed | 143 | 1.3807 ns | 0.0053 ns | 0.0050 ns |
| Span | 143 | 1.1812 ns | 0.0081 ns | 0.0071 ns |
| Span2 | 143 | 1.1831 ns | 0.0056 ns | 0.0052 ns |
so not an improvement, but both are better than the original fixed version, so: let's just go with the original PR proposal with the simple indexed usage?
Reverted to the original proposal.
Is there anything else I may follow up in this PR? cc. @mgravell
@BrennanConroy and @mgravell , I suppose it might be busy times. Do you have a blocker for this PR or awaiting RC release etc.?
@BrennanConroy could you please review?
/azp run
Azure Pipelines successfully started running 3 pipeline(s).
Thanks @ladeak, nice to see removal of unsafe and a bit of perf improvement at the same time.