MessagePack-CSharp icon indicating copy to clipboard operation
MessagePack-CSharp copied to clipboard

Perf: Use `Span.CopyTo<Byte>()` on .NET 7 and above.

Open jkrejcha opened this issue 10 months ago • 6 comments

This changes UnsafeMemory to use the safe equivalents in .NET 7 and above as they are more performant than the equivalent code and can take advantage of vectorization on hardware that supports it

This optimization exists on at least .NET 7.

jkrejcha avatar Mar 02 '25 08:03 jkrejcha

Thanks for this. But we don't need the #if I think. We only target .NET 8 at a minimum at this point, and .NET Standard 2.0 / .NET Framework, which also has these APIs through our reference to System.Memory.

Can you just replace the old code with the new code?

And then we should definitely get @neuecc's review (and maybe @pCYSl5EDgo) because they've done a lot of micro-optimization work in this library.

AArnott avatar Mar 02 '25 22:03 AArnott

Alright. It looks like .NET 7 and below doesn't have the optimization anyway potentially (it replaces with a call to Buffer.Memmove, indirection to GetSpan removed for sake of example).

Output on .NET 8

A.UnsafeMemory64:WriteRaw31_New(System.Span`1[ubyte],System.ReadOnlySpan`1[ubyte]) (FullOpts):
       push     rbp
       vzeroupper 
       mov      rbp, rsp
       cmp      ecx, 31
       jb       SHORT G_M17461_IG05
       cmp      esi, 31
       jb       SHORT G_M17461_IG06
       vmovdqu  xmm0, xmmword ptr [rdx]
       vmovdqu  xmm1, xmmword ptr [rdx+0x0F]
       vmovdqu  xmmword ptr [rdi], xmm0
       vmovdqu  xmmword ptr [rdi+0x0F], xmm1
       pop      rbp
       ret      
G_M17461_IG05:  ;; offset=0x0025
       call     [System.ThrowHelper:ThrowArgumentOutOfRangeException()]
       int3     
G_M17461_IG06:  ;; offset=0x002C
       call     [System.ThrowHelper:ThrowArgumentException_DestinationTooShort()]
       int3     

vs. .NET 7

A.UnsafeMemory64:WriteRaw31_New(System.Span`1[ubyte],System.ReadOnlySpan`1[ubyte]):
       push     rbp
       mov      rbp, rsp
       cmp      ecx, 31
       jb       SHORT G_M17461_IG04
       cmp      esi, 31
       jb       SHORT G_M17461_IG05
       mov      rsi, rdx
       mov      edx, 31
       call     [System.Buffer:Memmove(byref,byref,ulong)]
       nop      
       pop      rbp
       ret      
G_M17461_IG04:
       call     [System.ThrowHelper:ThrowArgumentOutOfRangeException()]
       int3     
G_M17461_IG05:
       call     [System.ThrowHelper:ThrowArgumentException_DestinationTooShort()]
       int3 
``` which still may be faster but hard to tell for sure without testing more thoroughly.

jkrejcha avatar Mar 03 '25 02:03 jkrejcha

Thanks. Two more thoughts.

So long as you're looking at native assembly emitted, can you look at .NET Framework? That's important too.

If this really is going to be the way to go, I guess we can get rid of all the many methods that you're changing too, as they were named and implemented specially for the length of the span, and that evidently isn't necessary any more since their implementations are now identical.

But feel free to wait for @neuecc's take before doing more work, since he may know something about why this isn't doable in the first place. Unity is another target runtime and it may be important there that the code remains as-is, for example.

AArnott avatar Mar 03 '25 12:03 AArnott

This seems like a remnant from an era before Span existed. I think the original original received byte[]. In other words, I think the changes are fine.

neuecc avatar Mar 04 '25 09:03 neuecc

In the end, it goes to SpanHelpers.memmove? https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/SpanHelpers.ByteMemOps.cs,36 The original goal is that if we know the length in advance, we can avoid branching. The code size also becomes smaller, which should allow for inlining. There's no room for vectorization when dealing with small data. Instead of vectorizing, we are unrolling with a fixed approach However, since memmove is intrinsic, would its behavior in .NET differ from the code's appearance? It's difficult to immediately determine whether it's better than a handwritten implementation (although not needing "fixed" is good).

neuecc avatar Mar 04 '25 10:03 neuecc

Minimal benchmarking shows that Span is better than the original. However, manual fixed value optimisation using Unsafe.WriteUnaligned was even better.

neuecc avatar Mar 04 '25 12:03 neuecc