aspnetcore StringUtilities with xplat-intrinsics (and a little perf-tuning)

Description

Updated some use of platform specific instrinsics with the xplat-instrinsics, so that more target platforms (like Arm64) can benefit from vectorization.

For the smaller "helper"-methods in the isolated benchmark an improvement is shown, but I don't expect this will show up in more macro-like benchmarks. Main benefit of that change is being more xplat.

For the public entry-point methods (in the internal type :wink:) the switch to xplat-instrinsics and some further optimizations shows quite big improvements.

I don't have a test-setup for Arm64 here, where I can run benchmarks. In theory, due xplat-instrinsics, there should be an improvement, but as said I can't validate this. Is there a way to run benchmarks for Arm64?

Results

machine info

BenchmarkDotNet=v0.13.2, OS=Windows 10 (10.0.19043.1889/21H1/May2021Update)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET SDK=7.0.100-rc.2.22426.5

WidenFourAsciiBytesToUtf16AndWriteToBuffer

|  Method |      Mean |     Error |    StdDev | Ratio | RatioSD | Code Size |
|-------- |----------:|----------:|----------:|------:|--------:|----------:|
| Default | 0.7093 ns | 0.0142 ns | 0.0133 ns |  1.00 |    0.00 |      44 B |
|      PR | 0.6203 ns | 0.0028 ns | 0.0025 ns |  0.87 |    0.02 |      41 B |

Benchmark code

Mostly a dump for myself, the actual code is copy & pasted from main-branch and changed.

using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkRunner.Run<Bench>();

[DisassemblyDiagnoser]
public unsafe class Bench
{
    private int _value = (byte)'a' | (byte)'b' << 8 | (byte)'c' << 16 | (byte)'d' << 24;
    private char* _output;

    [GlobalSetup]
    public void Setup()
    {
        _output = (char*)NativeMemory.AllocZeroed(100, sizeof(char));
    }

    [GlobalCleanup]
    public void Cleanup() => NativeMemory.Free(_output);

    [Benchmark(Baseline = true, OperationsPerInvoke = 1000)]
    public void A()
    {
        for (int i = 0; i < 1000; ++i)
        {
            WidenFourAsciiBytesToUtf16AndWriteToBuffer(_output, _value);
        }
    }

    [Benchmark(OperationsPerInvoke = 1000)]
    public void B()
    {
        for (int i = 0; i < 1000; ++i)
        {
            WidenFourAsciiBytesToUtf16AndWriteToBuffer1(_output, _value);
        }
    }

    private static void WidenFourAsciiBytesToUtf16AndWriteToBuffer(char* output, int value)
    {
        Vector128<sbyte> vecNarrow = Sse2.ConvertScalarToVector128Int32(value).AsSByte();
        Vector128<ulong> vecWide = Sse2.UnpackLow(vecNarrow, Vector128<sbyte>.Zero).AsUInt64();
        Unsafe.WriteUnaligned(output, Sse2.X64.ConvertToUInt64(vecWide));
    }

    private static void WidenFourAsciiBytesToUtf16AndWriteToBuffer1(char* output, int value)
    {
        Vector128<sbyte> vecNarrow = Vector128.CreateScalarUnsafe(value).AsSByte();
        Vector128<ulong> vecWide = Vector128.Widen(vecNarrow).Lower.AsUInt64();
        Unsafe.WriteUnaligned(output, vecWide.ToScalar());
    }
}

Perf-improvement comes from

vmovd     xmm0,r9d
-vxorps    xmm1,xmm1,xmm1
-vpunpcklbw xmm0,xmm0,xmm1
+vpmovsxbw xmm0,xmm0
vmovq     r9,xmm0

so from a more efficient machine-instruction in this case. Further code-size is smaller (by 3 bytes).

WidenFourAsciiBytesToUtf16AndCompareToChars

|  Method |      Mean |     Error |    StdDev | Ratio | RatioSD | Code Size |
|-------- |----------:|----------:|----------:|------:|--------:|----------:|
| Default | 0.8743 ns | 0.0142 ns | 0.0132 ns |  1.00 |    0.00 |      60 B |
|      PR | 0.7800 ns | 0.0114 ns | 0.0107 ns |  0.89 |    0.02 |      57 B |

WidenTwoAsciiBytesToUtf16AndCompareToChars

|  Method |      Mean |     Error |    StdDev | Ratio | RatioSD | Code Size |
|-------- |----------:|----------:|----------:|------:|--------:|----------:|
| Default | 0.8864 ns | 0.0174 ns | 0.0244 ns |  1.00 |    0.00 |      61 B |
|      PR | 0.7780 ns | 0.0096 ns | 0.0089 ns |  0.89 |    0.03 |      58 B |

CheckBytesInAsciiRange

Exact the same machine code generated for Avx2 -> Vector256 and Sse2 -> Vector128.

For the Vector<T> the PR produces better machine code:

vxorps    ymm1,ymm1,ymm1
vpcmpgtb  ymm0,ymm0,ymm1
-vpcmpeqd  ymm1,ymm1,ymm1
-vpcmpeqq  ymm0,ymm0,ymm1
vpmovmskb edx,ymm0
cmp       edx,0FFFFFFFF

TryGetAsciiString

moved to xplat instrinsics
dropped Vector<T> code-path (there's no platform kestrel run where neither Vector128 nor Vector256 is acceleareted but Vector<T> is)
processed the remainder in the vectorized path vectorized too due idempotence
switched layout to have scalar path (= short inputs) first, and the vectorized path below -- in case of longer inputs the jump down to the vectorized path will be amortized by processing more elemnts at once

In the bench-results the Length is chosen in respect to vectorization. 15 is just below vectorization kicks in, 31 is for Vector128 path just before Vector256 kicks in, and 100 is just arbitrary length.

|  Method | Length |     Mean |     Error |    StdDev |   Median | Ratio | RatioSD | Code Size |
|-------- |------- |---------:|----------:|----------:|---------:|------:|--------:|----------:|
| Default |     15 | 4.346 ns | 0.1270 ns | 0.2050 ns | 4.357 ns |  1.00 |    0.00 |     446 B |
|      PR |     15 | 4.313 ns | 0.1121 ns | 0.1049 ns | 4.286 ns |  0.97 |    0.05 |     526 B |
|         |        |          |           |           |          |       |         |           |
| Default |     31 | 5.590 ns | 0.1270 ns | 0.1126 ns | 5.581 ns |  1.00 |    0.00 |     446 B |
|      PR |     31 | 3.452 ns | 0.1069 ns | 0.1844 ns | 3.364 ns |  0.64 |    0.04 |     526 B |
|         |        |          |           |           |          |       |         |           |
| Default |    100 | 7.579 ns | 0.1854 ns | 0.3390 ns | 7.582 ns |  1.00 |    0.00 |     446 B |
|      PR |    100 | 6.740 ns | 0.1421 ns | 0.1330 ns | 6.672 ns |  0.92 |    0.05 |     526 B |

benchmark code

Mostly a dump for myself, the actual code is copy & pasted from main-branch and changed.

using System.Runtime.InteropServices;
using BenchmarkDotNet.Attributes;

Bench bench = new();
bench.Setup();
Console.WriteLine(bench.Default());
string s0 = bench.GetString();
bench.Cleanup();

bench.Setup();
Console.WriteLine(bench.PR());
string s1 = bench.GetString();
bench.Cleanup();
Console.WriteLine(s0);
Console.WriteLine(s1);
Console.WriteLine(s0 == s1);

#if !DEBUG
BenchmarkDotNet.Running.BenchmarkRunner.Run<Bench>();
#endif

[DisassemblyDiagnoser]
[ShortRunJob]
public unsafe class Bench
{
    private byte* _input;
    private char* _output;
    private int _count;

    internal string GetString() => new string(_output);

    [Params(15, 31, 100)]
    public int Length { get; set; } = 32 + 16 + 15;

    [GlobalSetup]
    public void Setup()
    {
        _count = this.Length;
        _input = (byte*)NativeMemory.Alloc((uint)this.Length);
        _output = (char*)NativeMemory.AllocZeroed((uint)this.Length, sizeof(char));

        if (_count < 64)
        {
            int idx = 0;
            for (int i = 'A'; i < 'A' + _count; ++i)
            {
                _input[idx++] = (byte)i;
            }

            _input[0] = (byte)'£';
            _input[_count - 1] = (byte)'£';
        }
        else
        {
            new Span<byte>(_input, _count).Fill((byte)'a');
        }
    }

    [GlobalCleanup]
    public void Cleanup()
    {
        NativeMemory.Free(_input);
        NativeMemory.Free(_output);
    }

    [Benchmark(Baseline = true)]
    public bool Default() => StringUtilities.TryGetLatin1String(_input, _output, _count);

    [Benchmark]
    public bool PR() => StringUtilities1.TryGetLatin1String(_input, _output, _count);
}

The improvement for length 31 is expected (due to processing the remainder vectorized). The improvement for length 100 I didn't expect, but maybe that's from smaller loop bodies (which is almost always better).

Note: I checked the numbers after just switching to xplat-instrinsics (before further optimizations) and the results were within noise, so on-par.

TryGetLatin1String

unified the code with TryGetAsciiString by using new C# goodness in form of static abstract interfaces

|  Method | Length |      Mean |    Error |    StdDev | Ratio | RatioSD | Code Size |
|-------- |------- |----------:|---------:|----------:|------:|--------:|----------:|
| Default |     15 |  7.715 ns | 7.250 ns | 0.3974 ns |  1.00 |    0.00 |     473 B |
|      PR |     15 |  4.315 ns | 1.120 ns | 0.0614 ns |  0.56 |    0.03 |     511 B |
|         |        |           |          |           |       |         |           |
| Default |     31 | 12.317 ns | 9.691 ns | 0.5312 ns |  1.00 |    0.00 |     473 B |
|      PR |     31 |  4.098 ns | 7.102 ns | 0.3893 ns |  0.33 |    0.02 |     511 B |
|         |        |           |          |           |       |         |           |
| Default |    100 |  8.343 ns | 9.522 ns | 0.5219 ns |  1.00 |    0.00 |     473 B |
|      PR |    100 |  8.226 ns | 3.028 ns | 0.1660 ns |  0.99 |    0.08 |     511 B |

BytesOrdinalEqualsStringAndAscii

More or less a re-write with inspiration from existing code :wink:.

|  Method | Length |     Mean |     Error |    StdDev | Ratio | RatioSD | Code Size |
|-------- |------- |---------:|----------:|----------:|------:|--------:|----------:|
| Default |     15 | 5.534 ns | 0.1498 ns | 0.1665 ns |  1.00 |    0.00 |     474 B |
|      PR |     15 | 5.234 ns | 0.1425 ns | 0.1399 ns |  0.94 |    0.04 |     653 B |
|         |        |          |           |           |       |         |           |
| Default |     31 | 8.812 ns | 0.1948 ns | 0.2793 ns |  1.00 |    0.00 |     474 B |
|      PR |     31 | 6.478 ns | 0.1696 ns | 0.3227 ns |  0.75 |    0.05 |     653 B |
|         |        |          |           |           |       |         |           |
| Default |    100 | 9.238 ns | 0.2090 ns | 0.3064 ns |  1.00 |    0.00 |     474 B |
|      PR |    100 | 8.890 ns | 0.1388 ns | 0.1159 ns |  0.99 |    0.03 |     653 B |

Sep 17 '22 19:09 gfoidl

Thanks for your PR, @gfoidl. Someone from the team will get assigned to your PR shortly and we'll get it reviewed.

Sep 17 '22 19:09 ghost

@tannergooding I'd be happy if you could have a look on the use of xplat-intrinsics here. Thanks in advance!

Sep 17 '22 20:09 gfoidl

@tannergooding Do you think you'll have a chance to take a look at this PR?

Oct 07 '22 19:10 adityamandaleeka

Possibly this weekend, but if I don't get to it by Monday then likely not until the week of the 24th.

I'm in the middle of buying a house so have been pretty busy the past few weeks. Closing on the 17th, so spending the next couple packing/moving.

@stephentoub and @adamsitnik have both been heavily involved in porting SIMD code in the BCL to use the xplat APIs so they might be good people to look as well.

Oct 07 '22 20:10 tannergooding

My hope actually is that this code completely goes away, with the new APIs being added in https://github.com/dotnet/runtime/pull/75012. Maybe we could focus on making those as efficient as possible and ensuring they're sufficient to then delete this code? (Regardless, thanks for your tireless efforts on this @gfoidl.)

Oct 07 '22 20:10 stephentoub

Maybe we could focus on making those as efficient as possible and ensuring they're sufficient to then delete this code?

Next week I'll scan these codes and collect what APIs are missing so that this goal can be achieved. Without having a look now, just from the top of my head, the use-cases here have some special cases so that the ASCII-APIs (solely) aren't sufficient -- but as said, I need to have a proper look.

Oct 07 '22 20:10 gfoidl

Thanks. Whether it's these APIs or other APIs, I'd like to get to a point where we have enough general-purpose vectorized helpers in the core libraries in .NET 8 that dotnet/aspnetcore doesn't need to do any additional vectorization beyond using those APIs. That's my north star, at least, and if there's something that will prevent us from getting there, I'm very interested in understanding what it is.

Oct 07 '22 20:10 stephentoub

Thanks @stephentoub, that sounds like a good plan.

@gfoidl thank you as always for the great work you do, especially in the performance space!

Oct 07 '22 20:10 adityamandaleeka

@gfoidl Have you had a chance to review and come up with a plan on what to do next for this?

Oct 19 '22 22:10 adityamandaleeka

For "I'll scan these codes and collect what APIs are missing so that this goal can be achieved." unfortunately I didn't had time so far, but it's scheduled as next item in my agenda.

Oct 20 '22 09:10 gfoidl

Sure, no rush. Just wanted to check if you'd already done it.

Oct 20 '22 21:10 adityamandaleeka

Is there a (reasonable easy) way to build aspnetcore on top of a custom runtime-build? I'd like to build runtime based on https://github.com/dotnet/runtime/pull/75012, then use that build within aspnetcore for easier evaluation what's possible, what's missing, etc.

Edit: copy & pasted the files manually in order to work with them.

Oct 24 '22 14:10 gfoidl

Let the 🌟 shine 😃 (/cc: @stephentoub)

How it could look when StringUtilities is based on the runtime building-blocks w/o custom code here: https://github.com/dotnet/aspnetcore/compare/d2a1c23d90d4e69df47616b59b7215b3da8cc9f6...25c56209d2e3e8b9f8922b260aff5d2271f5d021

I didn't measure perf, just made it working and that tests are passing.

What's missing to achieve this?

ASCII

In StringUtilities for ASCII values of the range (0x00, 0x80) are considered valid. Ascii.ToUtf16 treats the whole ASCII range [0x00, 0x80) as valid.

That's why in the PoC (diff above) I've put https://github.com/dotnet/aspnetcore/blob/25c56209d2e3e8b9f8922b260aff5d2271f5d021/src/Shared/ServerInfrastructure/StringUtilities.cs#L94-L98 to check for this and make the tests pass. This isn't ideal, as we now have a $O(2n)$ operation where it could be $O(n)$.

Thus something like

namespace System.Buffers.Text
{
    public static class Ascii
    {
        // existing methods
-       public static OperationStatus ToUtf16(ReadOnlySpan<byte> source, Span<char> destination, out int bytesConsumed, out int charsWritten);
+       public static OperationStatus ToUtf16(ReadOnlySpan<byte> source, Span<char> destination, out int bytesConsumed, out int charsWritten, bool treatNullAsInvalid = false);
    }
}

is needed.

Latin1

I don't know how hot Latin1 is here, but as it's special cased in https://github.com/dotnet/aspnetcore/blob/d3259f92851e4772d3230177be5b71be20d3ff6d/src/Servers/Kestrel/Core/src/Internal/Infrastructure/HttpUtilities.cs#L140-L143 I think it's hot enough to be optimized. Besided that standard Encoding.Latin1 can't be used solely, as 0x00 is considered invalid.

Thus basically the same as for ASCII above applies, i.e.

namespace System.Buffers.Text
{
+   public static class Latin1
+   {
+       // other methods similar as Ascii?
+       public static OperationStatus ToUtf16(ReadOnlySpan<byte> source, Span<char> destination, out int bytesConsumed, out int charsWritten, bool treatNullAsInvalid = false);
+   }
}

If the type Latin1 seems too heavy, too niche, whatever, as alternative one could use something like https://github.com/dotnet/aspnetcore/compare/25c56209d2e3e8b9f8922b260aff5d2271f5d021...e3afae2dcf436dfe357b5a961f0184e51681325e where Latin1 bytes are expanded to UTF-16 via Asii.ToUtf16 and if non-ASCII is met, then the remainder is done in scalar way. Though this is a naive approach, which should be perf-tested -- I don't have numbers on how likeley Latin1 inputs with ranges [0x80, 0xFF] are, but if they are rare then that (simple) approach should be good enough.

Strategy

Finish https://github.com/dotnet/runtime/pull/75012
Adjust Ascii.ToUtf16 to optionally treat \0 as invalid
Use Ascii.ToUtf16 here also for Latin1 (replacing this PR)
Benchmark
(optional) Feedback to runtime
(optional) Propose special Latin1.ToUtf16

Oct 24 '22 19:10 gfoidl

Thanks, @gfoidl.

@adamsitnik, @GrabYourPitchforks, see above.

Oct 24 '22 20:10 stephentoub

@gfoidl Can this PR be closed now?

Jan 05 '23 00:01 adityamandaleeka

Why close it? https://github.com/dotnet/runtime/issues/28230 is done, but no new APIs from there are used in this repo. In order to use them here the points from https://github.com/dotnet/aspnetcore/pull/44040#issuecomment-1289517635 need to be addressed.

How do we move forward? Open API-requests in runtime-repo for this (excluding \0) and what's about Latin1?

Jan 05 '23 09:01 gfoidl

@gfoidl I was quickly glancing through older PRs and interpreted the strategy above to mean that the next step is a new PR.

We could keep this open while the work is being done, but perhaps it would be better to track the remaining work in an issue(s) and continue the discussion there. That's probably the best way to move forward, especially since the next step sounds like an API proposal for excluding \0.

Thanks!

Jan 06 '23 19:01 adityamandaleeka

Created https://github.com/dotnet/aspnetcore/issues/45962 to move forward. This PR can be closed, as it would only move from hw-intrinsics to xplat-intrinsics, which itself is an improvement but for the .NET 8 timeframe we can do base the StringUtilities on core libraries provided APIs.

Jan 09 '23 12:01 gfoidl

Hi @gfoidl. It looks like you just commented on a closed PR. The team will most probably miss it. If you'd like to bring something important up to their attention, consider filing a new issue and add enough details to build context.

Jan 09 '23 12:01 ghost