designs
designs copied to clipboard
Add a proposal which suggests updating the xarch baseline target
As per the doc, I propose the minimum required hardware for x86/x64 on .NET should be changed from x86-x64-v1 to x86-x64-v2.
CC. @richlander, @jkotas, @MichalStrehovsky, @davidwrighton
Looks reasonable in general, though I'd like to touch on one specific point:
However, with the introduction of Native AOT we now have a higher consideration for scenarios where recompilation is not possible and therefore whatever the precompiled code targets is what all it has access to. There are some ways around this such as dynamic ISA checks or compiling a given method twice with dynamic dispatch selecting the appropriate implementation, but this comes with various downsides and often requires restructuring code in a way that can make it less maintainable.
Why not go for the best-of-both-worlds approach? Build and ship IL, and have AOT compilation occur on the destination machine at installation time, rather than JITting at runtime or AOTing at build time? This is essentially what Android does and it works quite well there.
This is essentially what Android does and it works quite well there.
This works when the runtime is part of the OS (or part of large app with complex installer) and the OS can manage the app lifecycle.
It does not work well for runtimes that ship independently on OS, like what .NET runtime is today.
Build and ship IL, and have AOT compilation occur on the destination machine at installation time, rather than JITting at runtime or AOTing at build time? This is essentially what Android does and it works quite well there.
One consideration is that Android owns the OS and so is able to guarantee the tools required to do that are available. They also don't support any concepts like "xcopy" of apps and centralize acquisition via the App Store.
I think doing the same for .NET would be pretty awesome, but it also comes with considerations like a larger deployment mechanism and potential other negative side effects.
Crossgen2 for a higher baseline is pretty much like this already, but without many of the drawbacks.
A few angles to consider:
-
Microsoft official build vs. source build: We can bump the baseline for Microsoft official build, keep the code for x86 v1 around for the time being and tell anybody who really needs it to build their own bits from sources. (In other words, drop the x86 v1 support level to community supported.)
-
32-bit vs. 64-bit: We can consider keeping the baseline for 32-bit and raising it only on 64-bit.
Sure, NativeAOT isn't built into the OS. How much work would it be to integrate it with a standard installer-generator system like MSI, though? It would never be The Standard, but it would at least be available for developers in the know.
Sure, NativeAOT isn't built into the OS. How much work would it be to integrate it with a standard installer-generator system like MSI, though?
It depends on what your requirements are. You can certainly do some variant of it on your own.
I do not expect we (Microsoft .NET team) will provide or recommend a solution like this. It would not pass our security signoff.
Huh. That's not the objection I'd have expected to see. What are the security concerns here?
For example, the binaries cannot be signed.
Wasn't signing eliminated from Core a few versions ago anyway? I remember that one pretty clearly because there were breaking changes in .NET 6 that broke my compiler, and when I complained about it the team refused to make even the most inconsequential of changes to alleviate the compatibility break.
I am not talking about strong name signing. I am talking about Microsoft authenticode, Apple app code signing, and similar type of signatures.
All right. So how does Android handle it?
I do not know the details on how Android handles this. I can tell you what it involved to make this scheme work with .NET Framework: NGen service process was recognized as a special process by the Windows OS that was allowed to vouch for authenticity of its output. It involved hardening like disallowing debugger attach to the NGen service process (again, another special service provided by the Windows OS) so that you cannot tamper with its execution.
Yeah, that makes sense. The AOT compiler has to be in a position of high trust for a scheme like that to work. Joe Duffy said something very similar about the Midori architecture.
Is related to https://github.com/dotnet/designs/pull/173? Or is this one here more about AOT?
Would this also cover how we compile the native parts of the (non-AOT) runtimes (GC, CoreCLR VM, etc.)?
My main concern would be about the user experience for the minority of users that don't meet this requirement - I'd like to avoid the user experience to be STATUS_ILLEGAL_INSTRUCTION with a crashdump. NativeAOT does a failfast with a message to stderr. It's not great. (It's obviously not visible for OutputType=WinExe, for example.)
Do we have any motivating scenarios that we expect to meaningfully improve? I tried TechEmpower Json benchmark, but I'm seeing some very confusing results (crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/platform.benchmarks.yml --scenario json --profile aspnet-citrine-lin --application.framework net7.0 --application.environmentVariables COMPlus_EnableSSE3=0 shows improved RPS with SSE3 disabled compared to the baseline which is the opposite of what I wanted/expected to measure).
shows improved RPS with SSE3 disabled compared to the baseline which is the opposite of what I wanted/expected to measure).
That's weird because SSE3 specifically doesn't bring any value (except, maybe, HADD for floats but it's unlikely to be touched in TE benchmarks). For shuffle it's SSSE3 that is interesting because it provides overloads we need.
shows improved RPS with SSE3 disabled compared to the baseline which is the opposite of what I wanted/expected to measure).
That's weird because SSE3 specifically doesn't bring any value (except, maybe, HADD for floats but it's unlikely to be touched in TE benchmarks). For shuffle it's SSSE3 that is interesting because it provides overloads we need.
Wait does COMPlus_EnableSSE3=0 only disable SSE3? I thought it works similar to how we do detection in codeman.cpp - not detecting SSE3 means we also consider SSSE3/4/4.2/AVX etc. unavailable. Or do I need to COMPlus_EnableXXX everything one by one to get the measurement I wanted to measure?
Is related to https://github.com/dotnet/designs/pull/173? Or is this one here more about AOT?
Its similar, but its about upgrading the baseline for AOT and therefore directly impacts all consumers of .NET
173 impacts the default for crossgen, which only provides worse startup performance on older hardware.
Would this also cover how we compile the native parts of the (non-AOT) runtimes (GC, CoreCLR VM, etc.)?
That would likely be up for debate. MSVC only provides /arch:SSE2 and /arch:AVX/AVX2, there is no equivalent for SSE3-SSE4.2. Clang/GCC do support these intermediaries, but I don't expect them to be as big of wins for native given the typical dynamic linking and limited explicit vectorization.
My main concern would be about the user experience for the minority of users that don't meet this requirement - I'd like to avoid the user experience to be STATUS_ILLEGAL_INSTRUCTION with a crashdump. NativeAOT does a failfast with a message to stderr. It's not great. (It's obviously not visible for OutputType=WinExe, for example.)
We have a similar message raised by the VM today as well, its just unlikely to ever be encountered since its just checking for SSE2.
Do we have any motivating scenarios that we expect to meaningfully improve? I tried TechEmpower Json benchmark, but I'm seeing some very confusing results (crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/platform.benchmarks.yml --scenario json --profile aspnet-citrine-lin --application.framework net7.0 --application.environmentVariables COMPlus_EnableSSE3=0 shows improved RPS with SSE3 disabled compared to the baseline which is the opposite of what I wanted/expected to measure).
Most codepaths that use Vector128<T>. Like @EgorBo called out there is pretty "core" functionality only available in these later ISAs.
Notably:
- SSE3 - Floating-Point
Alternating Add/Subtract,Horizontal Add,Horizontal Subtract - SSSE3 - Integer
Conditional Negate,Absolute Value,Bytewise Shuffle,Horizontal Add,Horizontal Subtract - SSE4.1 -
Dot Product(FP only),Blend,Round(FP only),Insert,Extract,Test, IntegerMin/Max - SSE4.2 -
64-bit Compare Greater Than - POPCNT
For SSSE3, the most important is Bytewise Shuffle. Doing arbitrary vector reordering is very expensive otherwise and so this is key for doing things like Reverse Endianness or handling edge cases to emulate other functionality.
For SSE4.1, the most important are Blend, Insert, Extract, and Test. Dot Product and Round are both important for many scenarios including WPF and other image manipulation scenarios due to heavy use of floating-point. In the case of Blend it allows simplifying (a & b) | (a & ~b) to a single instruction, effectively giving you "vectorized ternary select". This is a key part of handling leading/trailing elements or operating only on matched data. Insert/Extract are key for getting data into and out of the vector registers efficiently, and Test is key for efficiently determining (a & b) == 0 or (a & b) != 0, which is used for many paths to determine if any match exists and therefore whether more expensive computation has to be done.
Not having these means the codegen for many core algorithms, especially in the string/span handling can be significantly pessimized against newer hardware that a majority of customers are expected to have.
Wait does COMPlus_EnableSSE3=0 only disable SSE3? I thought it works similar to how we do detection in codeman.cpp - not detecting SSE3 means we also consider SSSE3/4/4.2/AVX etc. unavailable. Or do I need to COMPlus_EnableXXX everything one by one to get the measurement I wanted to measure?
They are hierarchical and so EnableSSE3=0 will also disable SSSE3, SSE4.1, SSE4.2, AVX, AVX2, etc.
Could you provide more concrete numbers and possibly codegen? This sounds unexpected and doesn't match what I've seen in past benchmarking comparisons.
We could certainly get more concrete numbers by running all of dotnet/performance with COMPlus_EnableSSE3=0 and see if anything exceptional pops out as having improved performance.
I'd rather this decision wasn't made solely on microbenchmarks. I have no doubts it helps microbenchmarks. They're good supporting evidence, but something that impacts the users is better as the main evidence. That's why I'm trying TechEmpower (it's an E2E number we care about).
Could you provide more concrete numbers and possibly codegen? This sounds unexpected and doesn't match what I've seen in past benchmarking comparisons.
I can give you what I did, but not much more than that. Hopefully it's enough to find what I'm doing wrong:
dotnet tool install -g Microsoft.Crank.Controller --version "0.2.0-*"
And then just run the above crank command with/without the --application.environmentVariables COMPlus_EnableSSE3=0.
Without the EnableSSE3=0 argument:
| load | |
| ---------------------- | ---------- |
| CPU Usage (%) | 79 |
| Cores usage (%) | 2,219 |
| Working Set (MB) | 38 |
| Private Memory (MB) | 363 |
| Start Time (ms) | 0 |
| First Request (ms) | 74 |
| Requests/sec | 1,214,438 |
| Requests | 18,336,914 |
| Mean latency (ms) | 0.71 |
| Max latency (ms) | 54.51 |
| Bad responses | 0 |
| Socket errors | 0 |
| Read throughput (MB/s) | 169.09 |
| Latency 50th (ms) | 0.38 |
| Latency 75th (ms) | 0.44 |
| Latency 90th (ms) | 0.56 |
| Latency 99th (ms) | 10.11 |
With the SSE3=0 argument:
| load | |
| ---------------------- | ---------- |
| CPU Usage (%) | 80 |
| Cores usage (%) | 2,229 |
| Working Set (MB) | 38 |
| Private Memory (MB) | 363 |
| Start Time (ms) | 0 |
| First Request (ms) | 75 |
| Requests/sec | 1,229,283 |
| Requests | 18,560,989 |
| Mean latency (ms) | 0.61 |
| Max latency (ms) | 32.51 |
| Bad responses | 0 |
| Socket errors | 0 |
| Read throughput (MB/s) | 171.16 |
| Latency 50th (ms) | 0.38 |
| Latency 75th (ms) | 0.44 |
| Latency 90th (ms) | 0.52 |
| Latency 99th (ms) | 8.05 |
I'd rather this decision wasn't made solely on microbenchmarks. I have no doubts it helps microbenchmarks. They're good supporting evidence, but something that impacts the users is better as the main evidence.
I agree that it shouldn't be made solely based on microbenchmarks. However, we also know well how frequently span/string APIs are used and the cost of branches in hot loops, even predicted branches. We likewise know the importance of these operations in scenarios like ML, Image Processing, Games, etc. I imagine coming up with real world benchmarks showing improvements won't be difficult.
With that being said, we also really should not restrict ourselves to a 20 year old baseline regardless. Such hardware is all officially out of support and discontinued by the respective hardware manufacturers. Holding out for such a small minority of hardware that likely isn't even running on a supported OS is ultimately pretty silly (and I expect such users aren't likely to be using new versions of .NET anyways). At some point, we need to have the freedom/flexibility to tell users that newer versions of .NET won't support hardware that old (at the very least "officially", Jan's suggestion of leaving the support in but making it community supported is reasonable; as would simply making it not the default).
I can give you what I did, but not much more than that. Hopefully it's enough to find what I'm doing wrong:
Does this work on Windows, or is it Linux only? Do you also need crank-agent like the docs I found suggest?
On Windows, I see
The specified endpoint url 'http://asp-citrine-lin:5001' for 'application' is invalid or not responsive: "No such host is known. (asp-citrine-lin:5001)"
Likewise, how much variance is there here run-to-run (that is against separate attempts to profile using the same command line)?
How much of this is R2R compiled (doing EnableSSE3=0 will throw out the corelib CG2 images/etc)?
Is this accounting for rejit/tiered compilation costs?
We likewise know the importance of these operations in scenarios like ML, Image Processing, Games, etc. I imagine coming up with real world benchmarks showing improvements won't be difficult.
I'd like us to have such E2E number - we're discussing making .NET-produced executables FailFast on 1 out of 100 machines in the wild by default - "string.IndexOf is a lot faster" is less convincing argument that it's the right choice than "X RPS improvement in web scenario Y", "X fps improvement in game Y", etc. That's the argument we'll give whenever someone complains about this choice (for me the important bit is that this is a requirement for where the code runs, not a requirement for the .NET developers machine - the .NET developers likely won't even know about this hardware floor until they hear from their user).
Does this work on Windows, or is it Linux only? Do you also need crank-agent like the docs I found suggest?
I run it on Windows. You need VPN on because asp-citrine-lin is a corpnet machine. AFAIK crank-agent is needed on the machine where you run the test (which is asp-citrine-lin in this case, so no need to worry about it).
Likewise, how much variance is there here run-to-run (that is against separate attempts to profile using the same command line)?
I made 2 runs each and there was some noise but the difference in the two runs looked conclusive. I think crank does a warmup, but it's really a ASP.NET team tool that I don't have much experience with (only to the extent that we track it and it's part of our release criteria, and therefore looks relevant).
I'd like us to have such E2E number - we're discussing making .NET-produced executables FailFast on 1 out of 100 machines in the wild by default - "string.IndexOf is a lot faster" is less convincing argument that it's the right choice than "X RPS improvement in web scenario Y", "X fps improvement in game Y", etc. That's the argument we'll give whenever someone complains about this choice (for me the important bit is that this is a requirement for where the code runs, not a requirement for the .NET developers machine - the .NET developers likely won't even know about this hardware floor until they hear from their user).
I expect its much less than this in practice, especially when taking into account Enterprise/cloud hardware, the users that are likely to be running on a supported OS (especially if you consider officially supported hardware, of which only Linux supports hardware this old), and that are likely to be running/using the latest versions of .NET.
It's worth noting I opened https://github.com/dotnet/sdk/issues/28055 so that we can, longer term, get more definitive information on this and other important hardware characteristics.
I run it on Windows. You need VPN on because asp-citrine-lin is a corpnet machine. AFAIK crank-agent is needed on the machine where you run the test (which is asp-citrine-lin in this case, so no need to worry about it).
👍. The below is the median of 5 results for each. I didn't notice any obvious outliers.
I notably ran both JSON and Plaintext to get two different comparisons. There is a clear disambiguator when SIMD is disabled entirely and there is a small but measurable difference between -v1 and -v2 with -v2 winning.
The default (which should be -v3 assuming these machines have AVX2 support) tends to be a bit slower than -v2 and this is likely because the payloads aren't large enough for Vector256<T> to benefit. Instead, the larger size coupled with the additional checks causes a small regression.
Json
.NET 7 - Default
| load | |
|---|---|
| CPU Usage (%) | 66 |
| Cores usage (%) | 1,847 |
| Working Set (MB) | 38 |
| Private Memory (MB) | 358 |
| Start Time (ms) | 0 |
| First Request (ms) | 115 |
| Requests/sec | 985,938 |
| Requests | 14,886,941 |
| Mean latency (ms) | 0.42 |
| Max latency (ms) | 51.21 |
| Bad responses | 0 |
| Socket errors | 0 |
| Read throughput (MB/s) | 142.92 |
| Latency 50th (ms) | 0.24 |
| Latency 75th (ms) | 0.27 |
| Latency 90th (ms) | 0.32 |
| Latency 99th (ms) | 7.16 |
.NET 7 - EnableAVX=0 (effectively target x86-64-v2)
| load | |
|---|---|
| CPU Usage (%) | 66 |
| Cores usage (%) | 1,847 |
| Working Set (MB) | 38 |
| Private Memory (MB) | 358 |
| Start Time (ms) | 0 |
| First Request (ms) | 113 |
| Requests/sec | 990,885 |
| Requests | 14,962,330 |
| Mean latency (ms) | 0.46 |
| Max latency (ms) | 39.10 |
| Bad responses | 0 |
| Socket errors | 0 |
| Read throughput (MB/s) | 143.64 |
| Latency 50th (ms) | 0.23 |
| Latency 75th (ms) | 0.27 |
| Latency 90th (ms) | 0.32 |
| Latency 99th (ms) | 8.18 |
.NET 7 - EnableSSE3=0 (effectively target x86-64-v1)
| load | |
|---|---|
| CPU Usage (%) | 66 |
| Cores usage (%) | 1,835 |
| Working Set (MB) | 38 |
| Private Memory (MB) | 358 |
| Start Time (ms) | 0 |
| First Request (ms) | 116 |
| Requests/sec | 980,136 |
| Requests | 14,799,763 |
| Mean latency (ms) | 0.49 |
| Max latency (ms) | 41.72 |
| Bad responses | 0 |
| Socket errors | 0 |
| Read throughput (MB/s) | 142.08 |
| Latency 50th (ms) | 0.24 |
| Latency 75th (ms) | 0.27 |
| Latency 90th (ms) | 0.32 |
| Latency 99th (ms) | 8.72 |
.NET 7 - EnableHWIntrinsic=0 (effectively disable SIMD)
| load | |
|---|---|
| CPU Usage (%) | 64 |
| Cores usage (%) | 1,783 |
| Working Set (MB) | 38 |
| Private Memory (MB) | 358 |
| Start Time (ms) | 0 |
| First Request (ms) | 211 |
| Requests/sec | 944,005 |
| Requests | 14,253,837 |
| Mean latency (ms) | 0.50 |
| Max latency (ms) | 48.26 |
| Bad responses | 0 |
| Socket errors | 0 |
| Read throughput (MB/s) | 136.84 |
| Latency 50th (ms) | 0.25 |
| Latency 75th (ms) | 0.29 |
| Latency 90th (ms) | 0.34 |
| Latency 99th (ms) | 9.27 |
Plaintext
.NET 7 - Default
| load | |
|---|---|
| CPU Usage (%) | 44 |
| Cores usage (%) | 1,221 |
| Working Set (MB) | 38 |
| Private Memory (MB) | 358 |
| Start Time (ms) | 0 |
| First Request (ms) | 90 |
| Requests/sec | 4,625,118 |
| Requests | 69,838,073 |
| Mean latency (ms) | 0.60 |
| Max latency (ms) | 29.17 |
| Bad responses | 0 |
| Socket errors | 0 |
| Read throughput (MB/s) | 582.23 |
| Latency 50th (ms) | 0.52 |
| Latency 75th (ms) | 0.76 |
| Latency 90th (ms) | 1.05 |
| Latency 99th (ms) | 0.00 |
.NET 7 - EnableAVX=0 (effectively target x86-64-v2)
| load | |
|---|---|
| CPU Usage (%) | 44 |
| Cores usage (%) | 1,232 |
| Working Set (MB) | 38 |
| Private Memory (MB) | 358 |
| Start Time (ms) | 0 |
| First Request (ms) | 93 |
| Requests/sec | 4,679,347 |
| Requests | 70,655,667 |
| Mean latency (ms) | 0.58 |
| Max latency (ms) | 35.71 |
| Bad responses | 0 |
| Socket errors | 0 |
| Read throughput (MB/s) | 589.06 |
| Latency 50th (ms) | 0.51 |
| Latency 75th (ms) | 0.75 |
| Latency 90th (ms) | 1.03 |
| Latency 99th (ms) | 0.00 |
.NET 7 - EnableSSE3=0 (effectively target x86-64-v1)
| load | |
|---|---|
| CPU Usage (%) | 44 |
| Cores usage (%) | 1,225 |
| Working Set (MB) | 38 |
| Private Memory (MB) | 358 |
| Start Time (ms) | 0 |
| First Request (ms) | 91 |
| Requests/sec | 4,635,911 |
| Requests | 69,999,632 |
| Mean latency (ms) | 0.59 |
| Max latency (ms) | 32.99 |
| Bad responses | 0 |
| Socket errors | 0 |
| Read throughput (MB/s) | 583.59 |
| Latency 50th (ms) | 0.53 |
| Latency 75th (ms) | 0.76 |
| Latency 90th (ms) | 1.09 |
| Latency 99th (ms) | 0.00 |
.NET 7 - EnableHWIntrinsic=0 (effectively disable SIMD)
| load | |
|---|---|
| CPU Usage (%) | 42 |
| Cores usage (%) | 1,178 |
| Working Set (MB) | 38 |
| Private Memory (MB) | 358 |
| Start Time (ms) | 0 |
| First Request (ms) | 158 |
| Requests/sec | 4,370,389 |
| Requests | 65,991,281 |
| Mean latency (ms) | 0.63 |
| Max latency (ms) | 32.96 |
| Bad responses | 0 |
| Socket errors | 0 |
| Read throughput (MB/s) | 550.17 |
| Latency 50th (ms) | 0.55 |
| Latency 75th (ms) | 0.80 |
| Latency 90th (ms) | 1.10 |
| Latency 99th (ms) | 0.00 |
I'm actually trying to do exactly this kind of real-world codebase benchmarking, to see if the .NET 7 performance benefits touted in the blog posts make a measurable difference in some performance-sensitive code. Unfortunately, I've been stymied by the inability to actually get anything to run in .NET 7. Any help would be welcome, and I promise to report back with relevant numbers once I have some to share.
It's worth noting I opened https://github.com/dotnet/sdk/issues/28055 so that we can, longer term, get more definitive information on this and other important hardware characteristics.
I'm not sure if that one would help - it's the hardware the .NET developers use, not hardware where .NET code runs. Developers are more likely to skew towards the latest and greatest. Users are the "secretary's machine" and "school computer". Windows org is more likely to have the kind of telemetry.
The below is the median of 5 results for each. I didn't notice any obvious outliers.
What command line arguments did you use for crank? The numbers for JSON are all a bit lower than I would expect (compare with mine above).
What command line arguments did you use for crank? The numbers for JSON are all a bit lower than I would expect (compare with mine above).
Ah, you know what I ran json.benchmarks.yml and plaintext.benchmarks.yml rather than platform.benchmarks.yml. That might've had something to do with it.
I'm not sure if that one would help - it's the hardware the .NET developers use, not hardware where .NET code runs. Developers are more likely to skew towards the latest and greatest. Users are the "secretary's machine" and "school computer". Windows org is more likely to have the kind of telemetry.
Which again, doesn't really matter when you consider that most Operating Systesm don't support hardware that old.
In the case of macOS, it looks to be impossible for any OS we currently support to be running on pre-AVX2 hardware.
In the case of Windows, 8.1 is the oldest client SKU we still support. For 8.1, Windows themselves updated the baseline CPU required for x64 (must have CMPXCHG16B and LAHF/SAHF). Various articles quote a comment stating "the number of affected processors are extremely small since this instruction has been supported for greater than 10 years.". For 7, its only supported with an ESU subscription in which case other factors like the Windows Processor Requirements list comes into play and they are all post -v3 processors. -- Even stricter requirements/expectations exist for Server
Linux is really the only interesting case where the kernel still officially supports running on an 80386 (older than we support) and where many distros intentionally keep their specs "low". This is also a case where many recommend using alternative GUIs or specialized distro-builds for such low-spec computers to help. Ubuntu's docs go so far as to describe 10 and 15 year old systems and the scenarios that will likely prevent their usage in a default configuration. The biggest of which is typically that they don't support and have no way of supporting an SSD.
In short, such hardware is simply too old to be meaningful and given our official OS support matrix, is already unlikely to have a good experience with the latest versions of .NET.