markdig Some optimization ideas on .NET Standard/Core w/ System.Memory

Some optimization ideas on .NET Standard/Core w/ System.Memory

Open Sergio0694 opened this issue 4 years ago • 16 comments

Hi, I discovered this library the other day during a Twitch stream, and I want to say amazing work with this project, it looks great! I spent a bit poking around the codebase and I really appreciate how well organized it is, and lots of small tricks you're doing all over the place (eg. BitVector128 comes to mind), so props to all the mainteiners for the great work! 👏

I'm opening this issue as I was wondering whether you'd consider taking a dependency on System.Span on .NET Standard 2.0 (and .NET Core 2.1 too, why not) to enable additional optimizations. Specifically, I noticed a lot of helper methods in the CharHelper class contain a fair bit of branching, which could be removed almost entirely by leveraging the C# 7.3 optimization with ReadOnlySpan<T> instances wrapping constant arrays being compiled to constant data in the .text segment.

I'm sure there could be other optimizations that could be done as well once having access to Span<T> and the Unsafe APIs, this is just the first that came to mind. Here's an example, consider this method:

https://github.com/lunet-io/markdig/blob/abc8aa25b72f1546ae625048bc35e6392fbba143/src/Markdig/Helpers/CharHelper.cs#L322-L362

On .NET Core 3.1 x64, it's JITted to this assembly:

C.IsAsciiPunctuation_Slow(Char)
    L0000: movzx eax, cx
    L0003: lea edx, [rax-0x21]
    L0006: cmp edx, 0x1f
    L0009: ja L0023
    L000b: mov eax, edx
    L000d: lea rdx, [rip+0x2c]
    L0014: mov edx, [rdx+rax*4]
    L0017: lea rcx, [rip-0x1e]
    L001e: add rdx, rcx
    L0021: jmp rdx
    L0023: lea edx, [rax-0x5b]
    L0026: cmp edx, 0x5
    L0029: jbe L0033
    L002b: add eax, 0xffffff85
    L002e: cmp eax, 0x3
    L0031: ja L0039
    L0033: mov eax, 0x1
    L0038: ret
    L0039: xor eax, eax
    L003b: ret

The switch in C# is converted into just 3 jump tables in IL, so the final assembly only contains 3 jumps, but still. As a proof of concept, here's the same method using a lookup table:

private static ReadOnlySpan<byte> AsciiPuntuationMap => new byte[]
{
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
    1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
    1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0
};

internal static bool IsAsciiPunctuation(this char c)
{
    bool isInRange = c <= 127;
    byte rangeFlag = Unsafe.As<bool, byte>(ref isInRange);
    int
        negativeFlag = rangeFlag - 1,
        mask = ~negativeFlag,
        offset = c & mask;
    ref byte r0 = ref MemoryMarshal.GetReference(AsciiPuntuationMap);
    byte r1 = Unsafe.Add(ref r0, offset);
    bool found = Unsafe.As<byte, bool>(ref r1);

    return found;
}

Which again on .NET Core 3.1 x64 is JITted to this assembly:

C.IsAsciiPunctuation(Char)
    L0000: movzx eax, cx
    L0003: cmp eax, 0x7f
    L0006: setle dl
    L0009: movzx edx, dl
    L000c: dec edx
    L000e: not edx
    L0010: and eax, edx
    L0012: movsxd rax, eax
    L0015: mov rdx, 0x1ff23aa0b20
    L001f: movzx eax, byte [rax+rdx]
    L0023: ret

Shorter code overall, and no conditional branches at all 😄

The same exact approach could be replicated for many other similar helpers, just with a separate lookup table. Here is a sharplab.io repro with both versions of that method, for comparison.

If it helps, I'm also integrating that optimized logic to index/clamp the lookup table with an index into an extension in the HighPerformance package in the Microsoft.Toolkit library, so one option could also be to take a dependency on that package instead once it's out, so that you wouldn't even have to add that code in here, just the lookup table and two lines to access it, which would help maintainability.

Just thought I'd share this idea, congrats again for the great work here! Cheers! 🍻

Mar 27 '20 22:03 Sergio0694

Hey, thanks for looking at this, more perf is always welcome! Thanks for sharing the stream and looking at this project as well, I'll try to tune in next time.

For the specific scenario, I would recommend reading a blog post by Egor https://egorbo.com/llvm-range-checks.html.

I have applied that logic (pretty great codegen + no lookup table) to Markdig's CharHelper methods, but found there was effectively no measurable overall throughput improvement when parsing, so I decided to not make the change in favour of readability. Idealy, such an optimization would be applied in the JIT (you likely already get it if using Mono with LLVM).

Regarding Span/Memory, it's more complicated. I have converted some places to use Span (conditional compilation for netcore) and would love to convert more as that could simplify a lot of parsing code + save some substring allocations. +better codegen from Jit.

The problem is that we cross-target .net framework and standard2.0 and I would like to avoid too much code divergence with preprocessor directives. For example, duplicated helper methods for input types: StringSlice, char*, Span...

Given that the current version of Markdig is very stable, I would be okay with leaving Framework/Desktop Mono with the current version and drop many build targets for newer versions. I don't know what @xoofx's opinion on that is.

Mar 27 '20 23:03 MihaZupan

@MihaZupan Alright, I was familiar with some of Egor's work but I had completely missed that blog post, that's really next level! Thanks for sharing, I definitely learnt something new today! 😄

I wonder whether it'd be possible to combine that with that trick I'm doing in my extension to remove even the last conditional branch that's left there in Egor's solution, which should improve the performance a bit in cases where the input data is fairly randomized. At least, that was the reason why I came up with it in my own extension instead of just checking the range and jumping, as I noticed removing branches entirely produced a nice speedup during profiling 🤔

Or actually, on second thought, considering this code:

bool IsReservedCharacter(char c)
{
    c = (char)(c - 36);
    if (c > 28)
        return false;
    return ((314575237 >> c) & 1) == 1;
}

And this map:

00010010110000000000100110000101 = 314575237
   |  | ||          |  ||    | |
   @  = ;:          /  ,+    & $

Couldn't that if statement be skipped entirely with no difference in semantics, since higher values would just zero-out that constant during the right shift anyway? I guess it's there for clarity and/or because depending on the input data that single branch could be faster than the other ops?

Anyway, thank you a lot for your feedback and your insights! I'm looking forward to seeing how this project moves forward, and to possibly see it starting to be integrated into the Microsoft.Toolkit library as well in the coming months (see https://github.com/windows-toolkit/WindowsCommunityToolkit/issues/3200 for the roadmap on that). That idea of dropping some legacy frameworks to be able to rely more on the new APIs in .NET Standard sounds great, so I'm curious to see what you guys decide on that front too 😊

Mar 28 '20 00:03 Sergio0694

since higher values would just zero-out that constant during the right shift anyway?

Sadly not. I don't know if there's some other trick that could be done to improve this at the assembly level.

2 >> 0        // 2
2 >> 1        // 1
2 >> 2        // 0
2 >> 3        // 0
2 >> 31       // 0
2 >> (0 + 32) // 2
2 >> (1 + 32) // 1
2 >> (2 + 32) // 0
2 >> (3 + 32) // 0

I am not familiar with that project. Is there something specific the parser & renderer are doing that would prevent users from switching to Markdig directly?

Mar 28 '20 00:03 MihaZupan

I noticed removing branches entirely produced a nice speedup during profiling 🤔

Profiling a microbenchmark or profiling a Markdig parse operation?

Mar 28 '20 00:03 MihaZupan

Ok I have to say I'm extremely surprised to discover that the right shift isn't working as I was expecting there, and I'm still not sure why that is to be honest, I'll have to do some reading on that 😅

As for the non-branching version then, I guess it should be something like this?

static bool IsReservedCharacter(char c)
{        
    int i = c - 36;
    bool isInRange = i <= 28;
    byte rangeFlag = Unsafe.As<bool, byte>(ref isInRange);
    int
        negativeFlag = rangeFlag - 1,
        mask = ~negativeFlag,
        shift = (314575237 >> i) & 1,
        and = shift & mask;
    byte result = unchecked((byte)and);
    bool valid = Unsafe.As<byte, bool>(ref result);

    return valid;
}

Which yields (here is on sharplab.io):

C.IsReservedCharacter(Char)
    L0000: movzx ecx, cx
    L0003: add ecx, 0xffffffdc
    L0006: cmp ecx, 0x1c
    L0009: setle al
    L000c: movzx eax, al
    L000f: dec eax
    L0011: not eax
    L0013: mov edx, 0x12c00985
    L0018: sar edx, cl
    L001a: and edx, 0x1
    L001d: and eax, edx
    L001f: ret

I've basically just combined my trick to clamp the index without branching, with the bit table from Egor's blog post. I'd be curious to see how this would fare in a benchmark 🤔

As for the Microsoft.Toolkit, users are technically already able to integrate something like Markdig manually, but they'd need to spend quite a bit of time to create the wrapper classes following the infrastructure that is then used by the MarkdownTextBlock control to render the parsed markdown (here is the current markdown parser, and here is the control). The idea would be to provide a built-in parser powered by Markdig that exposes classes that can be directly consumed by the existing MarkdownTextBlock, and make it the default parser for the control as well, as a dependency.

As for your last question, I just meant in microbenchmarks 😄

Mar 28 '20 00:03 Sergio0694

Ok I have to say I'm extremely surprised to discover that the right shift isn't working as I was expecting

It surprised me too when looking into this bitfield trick.

I'd be curious to see how this would fare in a benchmark

Would be worth retesting, especially if the JIT can inline some of this. I think you just filled my weekend schedule 😄

Mar 28 '20 00:03 MihaZupan

Yeah the resulting code seems small enough to be nicely inlined, plus the codegen is surprisingly clean despite the (arguably) ugly looking C# code with all those various hacks going on there 😆

Ahahahah sorry to hear about your weekend schedule, let me know how it goes though!

Mar 28 '20 00:03 Sergio0694

Given that the current version of Markdig is very stable, I would be okay with leaving Framework/Desktop Mono with the current version and drop many build targets for newer versions. I don't know what @xoofx's opinion on that is.

Yeah, I think it's time for Markdig to switch to these new Span APIs. It will make possible to optimize a few more cases (specially lookup table through readonlyspan). PR welcome.

Mar 28 '20 14:03 xoofx

@xoofx That's great to hear! Markdig is already quite efficient judging by those benchmarks you have in the readme, and it'll only get better from there! Off topic: your starklang project is incredible! 😄

@MihaZupan So, I've run a few benchmarks on the various methods we discussed as I found the whole thing quite interesting, and I have to say I'm now even more puzzled than I was before 🤔

I setup a few different methods to benchmark (omitting the inlining here for brevity):

IsReservedCharacter_Jumps (classic, individual checks)

public static bool IsReservedCharacter_Jumps(char character)
{
    return character == '>'
            || character == '<'
            || character == '+'
            || character == '-'
            || character == '['
            || character == ']'
            || character == '('
            || character == ')'
            || character == ':'
            || character == '.'
            || character == ',';
}

IsReservedCharacter_EgorBo (EgorBo's implementation)

public static bool IsReservedCharacter_EgorBo(char c)
{
    c = (char)(c - 40);

    if (c > 53)
        return false;

    return ((11258999073931387ul >> c) & 1) == 1;
}

IsReservedCharacter_EgorBo_2 (Same, but skipping the final test/setnz/movszx)

public static bool IsReservedCharacter_EgorBo_2(char c)
{
    c = (char)(c - 40);

    if (c > 53)
        return false;

    byte flag = unchecked((byte)((11258999073931387ul >> c) & 1));
    return Unsafe.As<byte, bool>(ref flag);
}

IsReservedCharacter_Trick (like EgorBo's, but with my trick to skip the jump)

public static bool IsReservedCharacter_Trick(char c)
{
    int i = c - 40;

    bool isInRange = (uint)i <= 63;
    byte byteFlag = Unsafe.As<bool, byte>(ref isInRange);
    int
        negativeFlag = byteFlag - 1,
        mask = ~negativeFlag,
        shift = unchecked((int)((11258999073931387ul >> i) & 1)),
        and = shift & mask;
    byte result = unchecked((byte)and);
    bool valid = Unsafe.As<byte, bool>(ref result);

    return valid;
}

IsReservedCharacter_Table (static lookup table, no branch)

private static ReadOnlySpan<byte> CharactersTable => new byte[]
{
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
    1,1,1,1,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    1, 0, 1, 0, 1,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0,
    1, 0, 1
};

[MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static bool IsReservedCharacter_Table(char c)
{
    bool isInRange = c < CharactersTable.Length;
    byte rangeFlag = Unsafe.As<bool, byte>(ref isInRange);
    int
        negativeFlag = rangeFlag - 1,
        mask = ~negativeFlag,
        offset = c & mask;
    ref byte r0 = ref MemoryMarshal.GetReference(CharactersTable);
    byte r1 = Unsafe.Add(ref r0, offset);
    bool found = Unsafe.As<byte, bool>(ref r1);

    return found;
}

IsReservedCharacter_Table_Jump (as above, but with range jump)

internal static bool IsReservedCharacter_Table_Jump(char c)
{
    if (c >= CharactersTable.Length) return false;

    ref byte r0 = ref MemoryMarshal.GetReference(CharactersTable);
    byte r1 = Unsafe.Add(ref r0, c);
    bool found = Unsafe.As<byte, bool>(ref r1);

    return found;
}

You can find the full benchmark (using BenchmarkDotNet) ready to go in this gist.

I run this both on my notebook with an Intel i7-8750H, and on my desktop with a Ryzen 2700X.

Intel i7-8750H

Method	Length	Mean	Error	StdDev	Ratio	RatioSD
Count_Jumps	128	213.6 ns	0.74 ns	0.65 ns	1.00	0.00
Count_EgorBo	128	120.0 ns	0.25 ns	0.23 ns	0.56	0.00
Count_EgorBo_2	128	120.9 ns	0.86 ns	0.76 ns	0.57	0.00
Count_Trick	128	186.9 ns	0.68 ns	0.63 ns	0.88	0.00
Count_Table	128	158.5 ns	0.34 ns	0.29 ns	0.74	0.00
Count_Table_Jump	128	112.5 ns	0.25 ns	0.22 ns	0.53	0.00

Count_Jumps	512	890.7 ns	5.29 ns	4.95 ns	1.00	0.00
Count_EgorBo	512	550.2 ns	10.99 ns	20.63 ns	0.63	0.03
Count_EgorBo_2	512	464.5 ns	2.20 ns	1.84 ns	0.52	0.00
Count_Trick	512	778.3 ns	6.85 ns	6.08 ns	0.87	0.01
Count_Table	512	648.8 ns	3.03 ns	2.83 ns	0.73	0.00
Count_Table_Jump	512	423.3 ns	2.23 ns	1.98 ns	0.48	0.00

Count_Jumps	1024	1,736.3 ns	5.06 ns	4.74 ns	1.00	0.00
Count_EgorBo	1024	1,177.6 ns	23.20 ns	30.97 ns	0.68	0.02
Count_EgorBo_2	1024	1,087.9 ns	13.81 ns	12.92 ns	0.63	0.01
Count_Trick	1024	1,662.8 ns	13.82 ns	12.92 ns	0.96	0.01
Count_Table	1024	1,341.6 ns	6.67 ns	6.24 ns	0.77	0.00
Count_Table_Jump	1024	874.4 ns	5.80 ns	5.43 ns	0.50	0.00

Count_Jumps	4096	7,632.2 ns	35.06 ns	31.08 ns	1.00	0.00
Count_EgorBo	4096	10,284.7 ns	68.62 ns	64.18 ns	1.35	0.01
Count_EgorBo_2	4096	13,445.2 ns	34.55 ns	30.63 ns	1.76	0.01
Count_Trick	4096	6,658.4 ns	20.24 ns	18.94 ns	0.87	0.00
Count_Table	4096	6,541.2 ns	31.64 ns	29.60 ns	0.86	0.01
Count_Table_Jump	4096	8,088.0 ns	154.44 ns	158.60 ns	1.06	0.02

Ryzen 2700X

Method	Length	Mean	Error	StdDev	Ratio
Count_Jumps	128	309.7 ns	2.91 ns	2.72 ns	1.00
Count_EgorBo	128	153.3 ns	1.84 ns	1.72 ns	0.50
Count_EgorBo_2	128	153.8 ns	1.10 ns	1.03 ns	0.50
Count_Trick	128	186.6 ns	1.35 ns	1.19 ns	0.60
Count_Table	128	165.8 ns	1.73 ns	1.53 ns	0.54
Count_Table_Jump	128	121.6 ns	1.69 ns	1.58 ns	0.39

Count_Jumps	512	1,577.2 ns	10.42 ns	9.74 ns	1.00
Count_EgorBo	512	723.2 ns	1.00 ns	0.94 ns	0.46
Count_EgorBo_2	512	696.4 ns	1.72 ns	1.52 ns	0.44
Count_Trick	512	872.8 ns	2.86 ns	2.54 ns	0.55
Count_Table	512	856.4 ns	9.18 ns	8.59 ns	0.54
Count_Table_Jump	512	634.4 ns	4.31 ns	3.37 ns	0.40

Count_Jumps	1024	3,206.0 ns	28.35 ns	25.14 ns	1.00
Count_EgorBo	1024	1,383.3 ns	6.73 ns	6.29 ns	0.43
Count_EgorBo_2	1024	1,670.3 ns	4.63 ns	4.33 ns	0.52
Count_Trick	1024	1,814.8 ns	5.71 ns	5.34 ns	0.57
Count_Table	1024	1,726.2 ns	15.26 ns	14.27 ns	0.54
Count_Table_Jump	1024	1,349.2 ns	9.42 ns	8.81 ns	0.42

Count_Jumps	4096	12,268.3 ns	65.73 ns	61.48 ns	1.00
Count_EgorBo	4096	9,958.6 ns	11.45 ns	9.56 ns	0.81
Count_EgorBo_2	4096	9,458.6 ns	20.59 ns	16.07 ns	0.77
Count_Trick	4096	7,502.7 ns	13.46 ns	11.93 ns	0.61
Count_Table	4096	7,303.2 ns	46.32 ns	43.32 ns	0.60
Count_Table_Jump	4096	8,560.2 ns	24.29 ns	22.72 ns	0.70

My takeaways and some questions:

Both on my i7-8750H and Ryzen 2700X, when using shorter tests (less than 4096 in length) the fastest version was often the last one (Count_Table_Jump), which uses both a conditional jump and a ReadOnlySpan<byte> as a lookup table. Even assuming that the CPU can cache the data being accessed from that span, how can this be faster than EgorBo's method which is only working on registers? I've run this multiple times, same result. I'm completely lost here 😶
On my i7-8750H, with the test with size 4096, EgorBo's version seems to completely fall apart and ends up being slower than even the naive version with the various checks. I have no idea how that's possible either to be honest, considering the naive version literally has 10 branches.
In this last test, both on my i7 and Ryzen, the two versions using my trick to skip the conditional branch entirely take the lead, being quite a bit faster than all the other versions, almost twice as fast on my i7 and I'd say about a ~30% faster on my Ryzen. This makes me think that with longer sequences of text, having a conditional branch in the code can really affect the performance quite a bit. Again though, I just don't understand how can the Count_Trick and Count_Table methods be virtually identical in this test, considering one is just working on local registers and the other is accessing a lookup table, so it has at least that one memory access that should definitely be slower (even if cached). I'm a bit confused on this point 🤔

cc. @EgorBo as I'm sure you'd be able to provide some insights on this, given that you're an expert on the matter. Besides, hopefully this is somewhat interesting for you as well 😊

NOTE: I'm not an expert and I might very well have made some mistakes in my benchmark, or be missing something very obvious here, so if that's the case please bear with me ahahah

Mar 28 '20 16:03 Sergio0694

Can you share the whole benchmark project?

I've spent quite a bit of time hammering away at Markdig code yesterday so I'll try to get some numbers now.

Mar 28 '20 18:03 MihaZupan

@MihaZupan Oh it's literally just that gist, copy-pasted into a blank .NET Core 3.1 project referencing BenchmarkDotNet. But sure, if the whole project helps, here it is: MarkdigCharacterPatternBenchmark.zip.

Looking forward to hearing about your findings! 😄

Mar 28 '20 18:03 Sergio0694

Oh sorry, missed that you already shared it

Mar 28 '20 18:03 MihaZupan

public static bool Test1(char c)
{
    c = (char)(c - 40);

    if (c > 53)
        return false;

    byte flag = unchecked((byte)((11258999073931387ul >> c) & 1));
    return Unsafe.As<byte, bool>(ref flag);
}

Can be written as

public static bool Test2(char c)
{
    int test = c - 40;

    if ((uint)test > 53)
        return false;

    byte flag = (byte)((11258999073931387ul >> test) & 1);
    return Unsafe.As<byte, bool>(ref flag);
}

to get a bit better codegen

C+Helpers.Test1(Char)
    L0000: movzx ecx, cx
    L0003: add ecx, 0xffffffd8
    L0006: movzx ecx, cx
    L0009: cmp ecx, 0x35
    L000c: jle L0011
    L000e: xor eax, eax
    L0010: ret
    L0011: movzx ecx, cx
    L0014: mov rax, 0x2800000054007b
    L001e: shr rax, cl
    L0021: and eax, 0x1
    L0024: ret

C+Helpers.Test2(Char)
    L0000: movzx ecx, cx
    L0003: add ecx, 0xffffffd8
    L0006: cmp ecx, 0x35
    L0009: jbe L000e
    L000b: xor eax, eax
    L000d: ret
    L000e: mov rax, 0x2800000054007b
    L0018: shr rax, cl
    L001b: and eax, 0x1
    L001e: ret

I'm seeing some pretty nice numbers with this. Didn't think an unsafe cast to bool could improve things until you pointed it out.

Mar 28 '20 23:03 MihaZupan

Ah, yeah that's nice, as it removes that extra movzx, I'm doing the same in my versions too 👍 I'm curious as to whether those additions could make it into LLVM (and possibly RyuJit) as well, as I see that LLVM is currently using an extra branch for that == 1 part:

  bt rdx, rcx
  jae .LBB0_2
  ret
.LBB0_2:
  xor eax, eax
  ret

Looks like this could be skipped entirely if it could just compile the equivalent of that byte cast and reinterpret cast we're using, which should improve the performance another fair bit there 😊

Would you happen to have any ideas regarding those questions in my previous message instead? I'd really love to understand why am I seeing those numbers, as some of those results just seem very weird for me (eg. the method with the lookup table being the fastest for some reason), I feel like there's some key insight that I'm missing that would make that all make sense, but I wouldn't know 🤔

Mar 29 '20 13:03 Sergio0694

Since this was really bugging me, I made some more tests, and it looks like at least on my Ryzen 2700X, and using one of the sample scripts from my app, the following things are true:

Removing the conditional branch in EgorBo's code ~doubles performances.
Using a lookup table instead of a local ulong bit mask is... Faster?! 😶

Here's the benchmark run:

Method	Mean	Error	StdDev	Ratio
Count_EgorBo_Branch	3.786 us	0.0047 us	0.0044 us	1.00
Count_EgorBo_NoBranch	2.093 us	0.0251 us	0.0222 us	0.55
Count_Lookup_Branch	2.789 us	0.0047 us	0.0044 us	0.74
Count_Lookup_NoBranch	1.764 us	0.0192 us	0.0161 us	0.47

Intel i7 8750H results (for comparison)

Method	Mean	Error	StdDev	Ratio
Count_EgorBo_Branch	4.384 us	0.0307 us	0.0272 us	1.00
Count_EgorBo_NoBranch	2.456 us	0.0031 us	0.0026 us	0.56
Count_Lookup_Branch	2.833 us	0.0165 us	0.0147 us	0.65
Count_Lookup_NoBranch	1.907 us	0.0020 us	0.0019 us	0.43

Full benchmark gist here. Also, here are the three tested methods:

Count_EgorBo_Branch (optimized EgorBo's solution)

public static bool Count_EgorBo_Branch(char c)
{
    int i = c - 40;

    if ((uint)i > 63) return false;

    const ulong bits = 11258999073931387ul;

    byte result = unchecked((byte)((bits >> i) & 1));
    bool flag = Unsafe.As<byte, bool>(ref result);

    return flag;
}

Count_EgorBo_NoBranch (branchless EgorBo's solution)

public static bool Count_EgorBo_NoBranch(char c)
{
    int i = c - 40;

    bool isInRange = (uint)i <= 63;
    byte byteFlag = Unsafe.As<bool, byte>(ref isInRange);
    int
        negativeFlag = byteFlag - 1,
        mask = ~negativeFlag,
        shift = unchecked((int)((11258999073931387ul >> i) & 1)),
        and = shift & mask;
    byte result = unchecked((byte)and);
    bool valid = Unsafe.As<byte, bool>(ref result);

    return valid;
}

IsOperator_Lookup_Branch (lookup table)

public static bool IsOperator_Lookup_Branch(char c)
{
    int i = c;

    if ((uint)i >= (uint)OperatorsLookupTable.Length) return false;

    ref byte r0 = ref MemoryMarshal.GetReference(OperatorsLookupTable);
    ref byte r1 = ref Unsafe.Add(ref r0, i);
    bool result = Unsafe.As<byte, bool>(ref r1);

    return result;
}

IsOperator_Lookup_NoBranch (optimized lookup table)

public static bool IsOperator_Lookup_NoBranch(char c)
{
    int i = c;

    bool isInRange = (uint)i <= (uint)OperatorsLookupTable.Length;
    byte rangeFlag = Unsafe.As<bool, byte>(ref isInRange);
    int
        negativeFlag = rangeFlag - 1,
        mask = ~negativeFlag,
        offset = i & mask;
    ref byte r0 = ref MemoryMarshal.GetReference(OperatorsLookupTable);
    ref byte r1 = ref Unsafe.Add(ref r0, offset);
    bool result = Unsafe.As<byte, bool>(ref r1);

    return result;
}

It makes sense that IsOperator_EgorBo_NoBranch is faster, due to the removal of the conditional check. What really boggles my mind though is how can IsOperator_Lookup_NoBranch be faster, despite the fact that it has essentially the same code as the previous method but with a memory access, with I figured would introduce at least some overhead?

Furthermore, both the versions with the lookup table (with and without the conditional branch) are faster than the original method by EgorBo, which makes it seem like using a lookup table in general is just faster than a local bitshift operation on a constant value in a register. I find that very odd, since the latter lacks a memory access entirely, as mentioned above.

I'm really confused about this and I'd really love to understand what's going on 😄

cc. @EgorBo as the original code is his, and @MihaZupan who is already following this thread anyway.

EDIT: apparently there are a couple of reasons for the numbers I'm seeing here, all credits to @saucecontrol for the explanation. I'll just quote his messages directly for future reference:

L1 cache is very fast, so if you eliminate a bunch of instructions by hitting the cache once, you'll win. but benchmarks will keep the value in cache when it might not be there during normal use. always take microbenchmarks with a grain of salt

And in response to why is the lookup table faster even though EgorBo's version only has pretty fast opcodes in general, like shr and and, which I wouldn't have expected to have much overhead:

yep, shr and and are fast, but there's a dependency chain between them so they can't execute in parallel. but yeah, that would have more predictable performance where the lookup will vary based on what's in the cache. when they're close, you almost always want to favor predictable performance

That last sentence was in reply to my question about whether it'd be a good rule of thumb in this case to favor the version using the bitwise lookup table as a ulong value, instead of the one using the static ReadOnlySpan<byte> table, given that it's just about 15% slower but has the advantage of offering consistent performance that is not influenced by the state of the cache.

Apr 04 '20 16:04 Sergio0694

I want to point out that regardless of the performance gains that the code is more secure by eliminating side-channel vulnerabilities. Given that this kind of library would be very useful processing large amounts of sensitive data, this is a big improvement.

Apr 05 '20 12:04 sharpninja

markdig markdig copied to clipboard

Some optimization ideas on .NET Standard/Core w/ System.Memory

markdig
markdig copied to clipboard