Mediator icon indicating copy to clipboard operation
Mediator copied to clipboard

Check out source generation performance

Open martinothamar opened this issue 3 years ago • 17 comments

See how fast source generation performance is here, if it should be improved etc.

martinothamar avatar May 28 '21 06:05 martinothamar

Not directly related to src gen performance, but performance of the generated source. Currently the src gen generates large switch tables for e.g. the Send method, something like:

        public global::System.Threading.Tasks.ValueTask<TResponse> Send<TResponse>(
            global::Mediator.IRequest<TResponse> request,
            global::System.Threading.CancellationToken cancellationToken = default
        )
        {
            switch (request)
            {
                case global::MyQuery r:
                {
                    var task = Send(r, cancellationToken);
                    return global::System.Runtime.CompilerServices.Unsafe.As<global::System.Threading.Tasks.ValueTask<global::MyResponse>, global::System.Threading.Tasks.ValueTask<TResponse>>(ref task);
                }

This scales rather badly for a lot of Requests (I tried it with a solution with 2500 Requests):

There is a concept for statically typed dictionaries, similar to https://github.com/asynkron/protoactor-dotnet/blob/dev/src/Proto.Actor/Utils/TypedDictionary.cs The idea is to use generic types in a static class to generate consecutive numbers to index an array with the value. From a hashing point of view, this is a Perfect Minimal Hash (all input values are known) I put a simplified version into a demo repo: https://github.com/Tornhoof/StaticTypeDictionaryBenchmark

A basic benchmark looks like this, where SwitchRequestN is the switch/case concept and StaticSwitchRequest is the static type dictionary concept.


BenchmarkDotNet=v0.13.1, OS=Windows 10.0.22000
Intel Core i7-10700 CPU 2.90GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK=6.0.300
  [Host]   : .NET 6.0.5 (6.0.522.21309), X64 RyuJIT
  ShortRun : .NET 6.0.5 (6.0.522.21309), X64 RyuJIT

Job=ShortRun  IterationCount=3  LaunchCount=1  
WarmupCount=3  

Method Mean Error StdDev
SwitchRequest1 177.936 ns 2.7435 ns 0.1504 ns
SwitchRequest100 537.292 ns 63.2024 ns 3.4643 ns
SwitchRequest200 868.002 ns 6.9180 ns 0.3792 ns
SwitchRequest300 1,213.257 ns 12.0224 ns 0.6590 ns
SwitchRequest400 1,570.105 ns 505.7319 ns 27.7209 ns
SwitchRequest500 1,954.081 ns 333.3473 ns 18.2719 ns
SwitchRequest600 2,296.974 ns 380.4903 ns 20.8560 ns
SwitchRequest700 2,614.459 ns 375.7176 ns 20.5944 ns
SwitchRequest800 3,001.314 ns 55.9098 ns 3.0646 ns
SwitchRequest900 3,542.332 ns 81.5554 ns 4.4703 ns
SwitchRequest1000 3,964.454 ns 55.4060 ns 3.0370 ns
StaticSwitchRequest1 4.188 ns 0.2662 ns 0.0146 ns
StaticSwitchRequest100 4.837 ns 1.7584 ns 0.0964 ns
StaticSwitchRequest200 4.529 ns 0.0561 ns 0.0031 ns
StaticSwitchRequest300 4.963 ns 0.1827 ns 0.0100 ns
StaticSwitchRequest400 4.530 ns 0.0132 ns 0.0007 ns
StaticSwitchRequest500 4.658 ns 0.3519 ns 0.0193 ns
StaticSwitchRequest600 4.533 ns 0.6353 ns 0.0348 ns
StaticSwitchRequest700 4.593 ns 0.3906 ns 0.0214 ns
StaticSwitchRequest800 4.511 ns 0.0095 ns 0.0005 ns
StaticSwitchRequest900 4.511 ns 0.0425 ns 0.0023 ns
StaticSwitchRequest1000 4.511 ns 0.0292 ns 0.0016 ns

As you see here, the later/deeper the Request is in the switch case the slower the call actually gets, the static version is a constant speed with a constant lookup into an array.

Bottom line: consider getting rid of the large switches as we know all valid cases and with that knowledge can create perfect hashes.

Tornhoof avatar May 31 '22 12:05 Tornhoof

Cool approach! I didn't know switch statements scaled so poorly, but I guess it makes sense since all we have is IRequest<>... has to be a runtime check.

So since we don't have the concrete T for the request, will be able to use that approach? Since in this piece

	public static int Switch<T>(T t) where T : class
	{
		var r = TypeIndex<IRequest>.Get<T>();
		return r?.Value ?? 0;
	}

T will be IRequest<TResponse> - so

        public static TValue Get<TKey>() => s_Values[TypeKey<TKey>.Id];

So TKey here will collide where requests have the same response type? Am I right in assuming we have to use the runtime type (i.e. request.GetType()) for the IRequest<TResponse> case?

martinothamar avatar May 31 '22 18:05 martinothamar

We need to know the concrete T for the static type dict trick to work, that is true. We only know the concrete TResponse at that point. So what would be possible is to split the large switch into a lot of smaller switches in different generic method for each known TResponse. The largest one would be Unit, but it is often the case that there are not many different concrete IRequest<TResponse> implementations per concrete TResponse. I looked through my code bases and in ~50% of my requests the TResponse is exactly used once, i.e. making it a 1:1 mapping, the other 50% vary between 2 and 10.

The switch case also uses IL isinstanceof which is sufficiently expensive to be slow: See https://sharplab.io/#v2:EYLgHgbALANALiAhgZwLYB8ACAmAjAWAChMAGAAk1wDoAlAVwDs4BLVAUyoGEB7VAB2YAbNgCcAyqIBuzAMZtkAbiJFMAZgrYynMgG8iZA2QDalWoxbsuvAcPFTZ8qgFk2cABbcAJgEl+ggBSm9EysHDz8QqISItJyyM6uHj5+APJ8LNwM8QCCAOa5IvLIzJJs3gyCzAxVuQCUALr6hmpkVXBkTv7eNGwAjnTy7SK1uk2GhsgA7sxwMm7+w2Pjo4TLa2QyKGxkACJkYCBL68t6q8fnBpgA7PtUAGqIggNKZxfjAL5H65vI2wCi+0OrzeBlOIOO11uDyebBe4MMn2BCK+FBuJDhyOBSxM1GCFjC1kidhiDniLncXl8fACQXMoSsEVs0VijnJSSpgjSGSyVDyBSKJTKFSqNQaSxabQ62C6PX6gzIwxWKNaADMFq1kLsyJ5asqwW9IZ57o9nsrEcc2IJfqr1cxNQC2LqkYZ9RdIRxoabnQZzWtIeilu9xktsbSQpZwjYovY4gkKclqYFcXSI4SmTHWYlKal0sxMjl8oVkMVSuVKtUGHVGoQAJASpgdVQyvoDZBDJ01041mvMFVkBZUADirgAKgBPPhsfwjAC8M7IcAnbG4ap2tQ7naI3ZrhT7kkQIm1ZHnu4oyfDBMZ0ZJsYAqllECqONlkAAebotwYwXYAPgWbD7YYXm3Q1jRhYCa0Rbte37EQh1HJdp2PedF0nFd/D+dct03Wtu1PfdD22E8ALPMwLwZKNiRZeJ72QR9nzfD85Tbb8/j/U8gOwusbg9E1YWwqDuLIANayDAwiERFR1DaUQVUQOQyCY1s4CIV1WgbT1th0XJXAURFJOIdQcC1EBFNlZTVPFaSNL448fzIKAXgMlpjIBUylMGSzgXrdpNLssgAFYnKAA===

for different implementations.

The fastest is something like this, similar to how you already use Unsafe.As:

    [System.Runtime.CompilerServices.MethodImpl(System.Runtime.CompilerServices.MethodImplOptions.AggressiveInlining)]
	public int M3(IRequest r)
	{
		if (r.GetType() == typeof(D))
		{
			ref var d = ref System.Runtime.CompilerServices.Unsafe.As<IRequest, D>(ref r);
			return d.Value;
		}
		if (r.GetType() == typeof(E))
		{
			ref var e = ref System.Runtime.CompilerServices.Unsafe.As<IRequest, E>(ref r);
			return e.Value;
		}
		return 0;
	}   

It is important that the GetType() is repeated each time and not stored in a variable because otherwise a lot of optimizations don't hit. AggressiveInlining is also required, I don't really know why. A preliminary benchmark for that version puts it roughly 3-4 times faster than the switch/case


BenchmarkDotNet=v0.13.1, OS=Windows 10.0.22000
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=7.0.100-preview.3.22167.1
  [Host]   : .NET 6.0.5 (6.0.522.21309), X64 RyuJIT
  ShortRun : .NET 6.0.5 (6.0.522.21309), X64 RyuJIT

Job=ShortRun  IterationCount=3  LaunchCount=1  
WarmupCount=3  

Method Mean Error StdDev Code Size
SwitchRequest1 101.6320 ns 2.2133 ns 0.1213 ns 65,974 B
SwitchRequest1000 2,095.2218 ns 498.7225 ns 27.3367 ns 65,974 B
UnsafeSwitchRequest1 0.3906 ns 0.0038 ns 0.0002 ns 41,382 B
UnsafeSwitchRequest1000 609.4602 ns 98.1695 ns 5.3810 ns 41,388 B

See the large difference in code size, combining that approach with the approach to split the methods by TResponse should keep the individual methods small and increase the performance a lot for large systems. Large methods always have the problem that many optimizations in the JIT stop after having e.g. 512 locals (which in our case would be 500 ifs more or less).

In general it would be possible to write a minimal perfect hash algorithm which maps all request types to an [0-n] array in all cases, but unfortunately there is no static type information available which can be used for something like that, so it would need to be done at runtmie on e.g. the type guid and that will be a rather slow init process, minimal perfect hash finding is kinda like bruteforce.

Tornhoof avatar May 31 '22 20:05 Tornhoof

Yup, sounds like a really good solution 👍 I'm pretty busy for the next couple of days but I'll try to look into this as soon as possible

martinothamar avatar Jun 01 '22 07:06 martinothamar

I took your benchmark branch (with the 1000 extra handlers) and tried to implement several different concepts to optimize the switch/case for the requests, but never achieved anything worthwhile, maybe around 10% after some nasty unsafe code, nothing even remotely like the changes from the sharplap example above.

Which leads me back to to the TypeDictionary, you've used that for notification in your branch. We still have the problem that we don't really have a proper compile time identifier for each request.

So why not add one via source generation? For each Request Type deriving from IRequest, add a source generated partial type implementing some id interface with an Id and then use that Id to access the handler array directly, as we can control which id to put in to the generated partial type, we can simply use it as a linear accessor. something like.

	public sealed partial class Request0 : IRequestIdentifier
	{
		int IRequestIdentifier.RequestId => 0;
	}

This is not the ideal solution, but probably the most performant one. There are some interesting type changes ahead in C#11 (maybe 12), where it is possible to scope certain types by file. maybe this can help in the future to hide the generated code from the user (better than just explicit interfaces)

Tornhoof avatar Jul 01 '22 19:07 Tornhoof

I asked Jared Parsons for advice on twitter, https://twitter.com/torn_hoof/status/1543631166686863366 he basically confirmed that there are no constant identifiers we could use out of the box. he suggested to check for type full name first and then do the isinstance check, but that doesn't make a big difference. I'll have some minimal perfect hashing code around somewhere, time to try that one out.

Tornhoof avatar Jul 05 '22 20:07 Tornhoof

I tried out minimal perfect hashes, there are conceptionally two major ways of creating the data for minimal perfect hashes via the full type name

First way

You take a known hash function with a seed and a displacement/intermediate map to map all inputs via the displacement map to the output, this might mean running the hash twice. See http://stevehanov.ca/blog/index.php?id=119 for a detailed explanation. The hash method needs to have a decent distribution, otherwise calculation of the displacement/intermediate table (via modifying the seed for a specific input value) takes too long. I too XXHash32 (integrated into .NET by now) for as a hash method. This works, but each hash calculation takes ~32ns on my system. Which makes it slower than the c# dictionary with the Type as a lookup. Dictionary Type lookup is 35ns on my system, the hashed version took between 45 and 90ns. The source is basically this: https://github.com/Tornhoof/StaticTypeDictionaryBenchmark/blob/main/PerfectMinimalHashGenerator.cs this includes the full code.

Second way

Instead of using a fixed hash function, the hash function is build based on the input data. For example: The test Requests, basically Namespace.Request1-100, so that method basically takes the last few characters to calculate the hash and then the hash method is very very simple, somehting like typeName[^2]+typeName[^1] + 5 and then look it up to the linear index. This is very fast because the hash function is very easy. This is basically was gperf does (I ran gperf on the input data for that). See http://www.dre.vanderbilt.edu/~schmidt/PDF/gperf.pdf for that. The lookup takes ~4ns on my system or ~6ns if gperf outputs a switch table. The sources for the gperf output converted to c# are: https://github.com/Tornhoof/StaticTypeDictionaryBenchmark/blob/main/GPerf.cs and https://github.com/Tornhoof/StaticTypeDictionaryBenchmark/blob/main/GPerf2.cs

Conclusion

The first way is too slow, but the gen code is trivial (~100 lines of code). The second way is fast, but the gen code is very complex (gperf is several thousand lines of code).

The idea works, but writing the required code to get the second way implemented is way out of scope. gperf is also not designed for many keys, we're talking hundreds not thousands.

Tornhoof avatar Jul 06 '22 12:07 Tornhoof

Hey, sorry for the slow answer... Thanks for looking into all of this. I've been busy with different stuff for a while now but I hope to get back to this during summer. This was surprisingly hard 😄 my initial testing didn't get me anywhere. As you also found, optimizing the branching way that we have now will never be ideal perfwise. Gperf sounded really interesting, but also complex as you note.

I've come to the conclusion that if we want the IRequest<TResponse>-path to be optimized (close to 0 overhead) we have to generate some identifier in a source generator (there is no way for us to attach data to System.Type). That is what you suggest above with IRequestIdentifier right? Then we can read it both at source gen time (to correct array indices) and when we receive the IRequest<TResponse>. This is a breaking change though and all requests etc have to be made partial... As an alternative, maybe we could add the identifier as an attribute using IL weaving instead?

I need to read up on the papers/articles you linked above, and I want to do some more testing/experimentation before deciding anything. Again thanks for all the help on this

martinothamar avatar Jul 06 '22 22:07 martinothamar

Yeah that's my suggestion with IRequestIdentifier. There are ways to use Attributes, e.g. GuidAttribute, primarily used for COM, this shows up as Type.GUID then, but even switching on something that integrated into the Type System is a lot slower than having our own identifier. Doing switching on our own custom attributes is very slow as attribute lookup at runtime via the type system is not fast, most likely slower than a Dictionary<Type, Handler> There are ways to improve comparing complex identifiers, like if-ladders where only parts of the id are compared until you hit the correct one. I wrote something like that for SpanJson (https://github.com/Tornhoof/SpanJson/wiki/Technology-and-Internals#2-nested-if-approach), but this does not scale well above ~100 values.

So the preferred way is still some indexable id and I don't see any way around srcgen for that at the moment.

Tornhoof avatar Jul 07 '22 06:07 Tornhoof

NET 8 will have "Frozen" collections that optimize access based on input data (not sure if it's an implementation of perfect hashing). https://github.com/dotnet/runtime/pull/77799

Would that simplify the generated code?

irunjic avatar Dec 05 '22 11:12 irunjic

Very cool, seems like there are a lot of different optimizations especially for string keys. I don't think it can compete with source generated indices, though there is the complication with the source generated indices - the running process (where Mediator is generated) can refer to multiple projects defining their own set of indices, which can overlap. So when building the running project these indices will have to be computed in some way based on the containing assembly/project so that collisions are avoided. So basically I'm not sure, I haven't done any groundwork on this yet unfortunately but we definitely should explore both and benchmark. Thanks for bringing it up!

martinothamar avatar Dec 05 '22 14:12 martinothamar