aspnetcore
aspnetcore copied to clipboard
Improve Kestrel connection metrics with error reasons
Fixes https://github.com/dotnet/aspnetcore/issues/53358
This PR improves the kestrel.connection.duration metric by adding information about why the connection ended.
Tags added or changed:
-
http.connection.protocol_code
- Standards based, publicly communicated code for the connection ending. Often sent to the client. The value is either the HTTP/2 error code or HTTP/3 error code that the connection ended with. Doesn't apply to HTTP/1.1. -
kestrel.connection.end_reason
- A more detailed internal reason about why a connection ended. Always present. -
error.type
- If the end reason is an error, e.g. connection killed because of timeout, excessive stream resets, transport ended while there are active HTTP requests, invalid frame, etc.error.type
could be the same as the end reason. The value could also be an exception type name if an unhandled exception is thrown from connection middleware pipeline. The first error recorded wins. Only present if there is an error.
We need to decide what the new tags is. Is it the end reason or error reason? For example, a connection can be ended by the client sending a go away frame. Do we want to track that? It's not an error to close a connection like that, so in this case we're are tracking all end reasons. On the other hand, we might want to only track unexpected and unhealthy reasons that end the connection. The benefit of only tracking errors is it is very easy to get just the unhealthy connections by filtering metrics to ones that have this tag. However, I think we can achieve this by also putting error reasons into the error.type
tag.
tldr kestrel.connection.end_reason
is always present and provides the reason, error.type
is present if the end reason is considered an error and has the same value as kestrel.connection.end_reason
.
Questions for folks:
- Is having information about all connection end reasons useful? If it isn't then we could remove
kestrel.connection.end_reason
and just haveerror.type
. - What end reasons are an error? For example, are the following errors or not...
- Transport closing when there are no active HTTP requests
- Transport closing when there are active HTTP requests
- Connection closing because the application is shutting down
- Client sends GOAWAY with error code of NO_ERROR
- Client sends GOAWAY with error code of INTERNAL_ERROR
Tasks:
- [x] Add to HTTP/1.1
- [x] Add to HTTP/2
- [x] Add to HTTP/3
- [x] Add to HTTPS middleware
- [x] Tests
- [ ] Bike shed tag and enum names
- [ ] Review reasons are correct
- [ ] Validate metrics output useful data in some known situations
cc @noahfalk @tarekgh @lmolkova @samsp-msft
Is having information about all connection end reasons useful?
What was the motivation to add the data initially? I can hypothesize a reason someone might find it useful (diagnosing why connections didn't live as long as they were expected to live) but I don't know how often someone needs to diagnose that issue in practice and what alternatives would they have available if this metric attribute didn't exist?
What end reasons are an error? For example, are the following errors or not..
Any attempt at classification gets very subjective but a few suggestions at least:
- Ideally errors represent situations that do not occur frequently and are undesirable. For example "Connection closing because the application is shutting down" might be infrequent but it doesn't sound undesirable.
- Ideally metric error classifications would consistent with other observable behavior that implies an error classification. For example if in-proc code throws an Exception or some error code is transmitted in the wire protocol then it would consistent for the metrics to also classify that condition as an error. If "Client sends GOAWAY with error code of NO_ERROR" then it would consistent for metrics also not to treat that as an error.
I think it's valuable to know the end reason even if it's not an error and even if we don't have a specific scenario in mind. I also don't see any drawbacks from including it - it would not increase cardinality much.
The change looks great from semconv perspective! LMK if you need any help adding new attributes to otel.
This seems like data that should be going into logs not metrics. Metric dimensions should be kept small as adding additional properties and values adds incremental storage. If we can reduce this data down to one property - was the connection closed cleanly or due to an error, and provide logs to give more details as to why. Metrics are supposed to be about aggregated statistics not detailed analysis as to what happened.
Anyone who is going to be doing analysis, can the use the metric to determine if the trend is changing, and then go to a log query to understand why.
Once exemplars are better supported in .NET/OTel they will provide the correlation between metrics and traces/logs.
@samsp-msft
There is prior art for showing more error info in error.type
. That tag is already on HTTP requests for server and client in spans and metrics. Reducing error info down to a boolean is optimizing too much and goes against conventions.
http.connection.protocol_code
(exact name TBD) is the HTTP connection version of request's HTTP status code. I don't think it is controversial.
kestrel.connection.end_reason
is more debatable. My feeling from talking to people using Kestrel over the years is connections are a black box and people want more insight here. If there are performance/technical reasons why an extra tag is problematic, then I can reach out to ops folks to get feedback on whether end reasons for "non-error" situations would be useful.
Note that while the ConnectionEndReason
enum has a lot of potential values (about 35), many of them are for protocol violations which will almost never happen, or are specific to one particular HTTP protocol (e.g. closing a critical stream is specific to HTTP/3). There are about half a dozen would occur in typical usage:
- ConnectionReset
- UnexpectedError
- AbortedByApplication
- ApplicationShutdown
- ServerTimeout
- IOError
- ClientGoAway
- TransportCompleted
That is similar to how there are many HTTP status codes, but most of them you'll never see.
Is having information about all connection end reasons useful? If it isn't then we could remove kestrel.connection.end_reason and just have error.type.
With the notable disclaimer that I haven't actually monitored the health of a real service...
I think I'd want to have it for all connections. Sure, the majority will be graceful shutdowns (hopefully), but it's (I assume) easy enough to filter out a few uninteresting values and which values are uninteresting seems (as mentioned above) very subjective.
Another reason to include things that don't seem like errors is for making attack signatures. Something that's "normal" for a single connection may be problematic when it happens on many connections simultaneously. Or it may not be problematic, but still indicate that a change that signals a problem will soon occur.
Personally, I think I'd have just end_reason and drop error_type altogether (with the caveat that we might want a special way to represent unhandled exceptions).
If we go the other way, we have all sorts of strange cases to consider like whether cutting off a slow-reading client is an error. Arguably, that's the system functioning as expected and desired, but it's still interesting.
This seems like data that should be going into logs not metrics
@samsp-msft I'm not sure I'm following this line of reasoning. Given that logs are structured and queryable, why would anything not be logs-only? When would you use metrics instead?
/benchmark plaintext aspnet-citrine-win kestrel
Benchmark started for plaintext on aspnet-citrine-win with kestrel. Logs: link
An error occurred, please check the logs
/benchmark plaintext aspnet-citrine-win kestrel
Benchmark started for plaintext on aspnet-citrine-win with kestrel. Logs: link
plaintext - aspnet-citrine-win
application | plaintext.base | plaintext.pr | |
---|---|---|---|
Max CPU Usage (%) | 78 | 73 | -6.41% |
Max Cores usage (%) | 2,170 | 2,038 | -6.08% |
Max Working Set (MB) | 73 | 72 | -1.37% |
Max Private Memory (MB) | 55 | 54 | -1.82% |
Build Time (ms) | 3,715 | 3,842 | +3.42% |
Start Time (ms) | 335 | 321 | -4.18% |
Published Size (KB) | 107,273 | 107,273 | 0.00% |
Symbols Size (KB) | 54 | 54 | 0.00% |
.NET Core SDK Version | 9.0.100-preview.7.24369.5 | 9.0.100-preview.7.24369.5 | |
Max CPU Usage (%) | 79 | 71 | -10.61% |
Max Working Set (MB) | 76 | 75 | -0.49% |
Max GC Heap Size (MB) | 24 | 23 | -1.67% |
Size of committed memory by the GC (MB) | 18 | 15 | -19.73% |
Max Number of Gen 0 GCs / sec | 2.00 | 2.00 | 0.00% |
Max Number of Gen 1 GCs / sec | 1.00 | 1.00 | 0.00% |
Max Number of Gen 2 GCs / sec | 1.00 | 1.00 | 0.00% |
Max Gen 0 GC Budget (MB) | 19 | 20 | +5.26% |
Max Time in GC (%) | 31.00 | 7.00 | -77.42% |
Max Gen 0 Size (B) | 3,756,032 | 2,936,944 | -21.81% |
Max Gen 1 Size (B) | 420,464 | 230,912 | -45.08% |
Max Gen 2 Size (B) | 1,724,936 | 1,756,264 | +1.82% |
Max LOH Size (B) | 121,000 | 121,000 | 0.00% |
Max POH Size (B) | 9,632,368 | 6,954,368 | -27.80% |
Max Allocation Rate (B/sec) | 8,558,736 | 8,120,736 | -5.12% |
Max GC Heap Fragmentation (%) | 742% | 727% | -2.05% |
# of Assemblies Loaded | 66 | 66 | 0.00% |
Max Exceptions (#/s) | 1,718 | 1,758 | +2.33% |
Max Lock Contention (#/s) | 137 | 150 | +9.49% |
Max ThreadPool Threads Count | 43 | 46 | +6.98% |
Max ThreadPool Queue Length | 446 | 469 | +5.16% |
Max ThreadPool Items (#/s) | 1,884,981 | 1,897,962 | +0.69% |
Max Active Timers | 1 | 1 | 0.00% |
IL Jitted (B) | 223,730 | 223,573 | -0.07% |
Methods Jitted | 2,965 | 2,975 | +0.34% |
load | plaintext.base | plaintext.pr | |
---|---|---|---|
Max CPU Usage (%) | 81 | 80 | -1.23% |
Max Cores usage (%) | 2,277 | 2,254 | -1.01% |
Max Working Set (MB) | 46 | 46 | 0.00% |
Max Private Memory (MB) | 370 | 370 | 0.00% |
Build Time (ms) | 3,465 | 3,559 | +2.71% |
Start Time (ms) | 0 | 0 | |
Published Size (KB) | 72,283 | 72,283 | 0.00% |
Symbols Size (KB) | 0 | 0 | |
.NET Core SDK Version | 8.0.303 | 8.0.303 | |
First Request (ms) | 91 | 92 | +1.10% |
Requests/sec | 9,955,598 | 9,999,865 | +0.44% |
Requests | 150,316,336 | 150,997,558 | +0.45% |
Mean latency (ms) | 34.77 | 49.03 | +41.01% |
Max latency (ms) | 720.32 | 1,250.00 | +73.53% |
Bad responses | 0 | 0 | |
Socket errors | 0 | 0 | |
Read throughput (MB/s) | 1,198.08 | 1,198.08 | 0.00% |
Latency 50th (ms) | 0.89 | 0.89 | 0.00% |
Latency 75th (ms) | 29.18 | 41.63 | +42.67% |
Latency 90th (ms) | 119.44 | 171.25 | +43.38% |
Latency 99th (ms) | 387.50 | 537.67 | +38.75% |
/benchmark
Crank Pull Request Bot
/benchmark <benchmark[,...]> <profile[,...]> <component,[...]> <arguments>
Benchmarks:
-
plaintext
: TechEmpower Plaintext Scenario - ASP.NET Platform implementation -
plaintext_endpoint
: TechEmpower Plaintext Scenario - ASP.NET Endpoint Routing -
json
: TechEmpower JSON Scenario - ASP.NET Platform implementation -
fortunes
: TechEmpower Fortunes Scenario - ASP.NET Platform implementation -
yarp
: YARP - http-http with 10 bytes -
mvcjsoninput2k
: Sends 2Kb Json Body to an MVC controller
Profiles:
-
aspnet-perf-lin
: INTEL/Linux 12 Cores -
aspnet-perf-win
: INTEL/Windows 12 Cores -
aspnet-citrine-lin
: INTEL/Linux 28 Cores -
aspnet-citrine-win
: INTEL/Windows 28 Cores -
aspnet-citrine-ampere
: Ampere/Linux 80 Cores -
aspnet-citrine-amd
: AMD/Linux 48 Cores
Components:
-
kestrel
-
mvc
-
routing
Arguments: any additional arguments to pass through to crank, e.g. --variable name=value
/benchmark plaintext aspnet-citrine-win kestrel
Benchmark started for plaintext on aspnet-citrine-win with kestrel. Logs: link
plaintext - aspnet-citrine-win
application | plaintext.base | plaintext.pr | |
---|---|---|---|
Max CPU Usage (%) | 68 | 81 | +19.12% |
Max Cores usage (%) | 1,903 | 2,266 | +19.08% |
Max Working Set (MB) | 72 | 72 | 0.00% |
Max Private Memory (MB) | 54 | 54 | 0.00% |
Build Time (ms) | 3,279 | 3,293 | +0.43% |
Start Time (ms) | 319 | 320 | +0.31% |
Published Size (KB) | 107,273 | 107,273 | 0.00% |
Symbols Size (KB) | 54 | 54 | 0.00% |
.NET Core SDK Version | 9.0.100-preview.7.24371.4 | 9.0.100-preview.7.24371.4 | |
Max CPU Usage (%) | 68 | 96 | +41.29% |
Max Working Set (MB) | 75 | 75 | -0.02% |
Max GC Heap Size (MB) | 23 | 23 | -0.51% |
Size of committed memory by the GC (MB) | 13 | 13 | -5.83% |
Max Number of Gen 0 GCs / sec | 3.00 | 3.00 | 0.00% |
Max Number of Gen 1 GCs / sec | 2.00 | 2.00 | 0.00% |
Max Number of Gen 2 GCs / sec | 2.00 | 2.00 | 0.00% |
Max Gen 0 GC Budget (MB) | 17 | 18 | +5.88% |
Max Time in GC (%) | 13.00 | 10.00 | -23.08% |
Max Gen 0 Size (B) | 2,109,408 | 1,555,000 | -26.28% |
Max Gen 1 Size (B) | 592,024 | 652,128 | +10.15% |
Max Gen 2 Size (B) | 1,543,856 | 1,583,944 | +2.60% |
Max LOH Size (B) | 121,000 | 121,000 | 0.00% |
Max POH Size (B) | 6,480,568 | 6,299,288 | -2.80% |
Max Allocation Rate (B/sec) | 12,024,752 | 9,136,040 | -24.02% |
Max GC Heap Fragmentation (%) | 641% | 675% | +5.40% |
# of Assemblies Loaded | 66 | 66 | 0.00% |
Max Exceptions (#/s) | 1,636 | 1,622 | -0.86% |
Max Lock Contention (#/s) | 97 | 237 | +144.33% |
Max ThreadPool Threads Count | 48 | 48 | 0.00% |
Max ThreadPool Queue Length | 390 | 303 | -22.31% |
Max ThreadPool Items (#/s) | 1,868,889 | 1,917,623 | +2.61% |
Max Active Timers | 1 | 1 | 0.00% |
IL Jitted (B) | 230,992 | 230,352 | -0.28% |
Methods Jitted | 3,056 | 3,060 | +0.13% |
load | plaintext.base | plaintext.pr | |
---|---|---|---|
Max CPU Usage (%) | 82 | 81 | -1.22% |
Max Cores usage (%) | 2,288 | 2,265 | -1.01% |
Max Working Set (MB) | 46 | 46 | 0.00% |
Max Private Memory (MB) | 370 | 370 | 0.00% |
Build Time (ms) | 3,477 | 3,353 | -3.57% |
Start Time (ms) | 0 | 0 | |
Published Size (KB) | 72,283 | 72,283 | 0.00% |
Symbols Size (KB) | 0 | 0 | |
.NET Core SDK Version | 8.0.303 | 8.0.303 | |
First Request (ms) | 92 | 92 | 0.00% |
Requests/sec | 9,798,015 | 9,894,495 | +0.98% |
Requests | 147,952,480 | 149,398,880 | +0.98% |
Mean latency (ms) | 22.60 | 65.35 | +189.16% |
Max latency (ms) | 914.55 | 2,000.00 | +118.69% |
Bad responses | 0 | 0 | |
Socket errors | 0 | 94 | +∞% |
Read throughput (MB/s) | 1,177.60 | 1,187.84 | +0.87% |
Latency 50th (ms) | 0.92 | 0.88 | -4.35% |
Latency 75th (ms) | 12.34 | 33.75 | +173.50% |
Latency 90th (ms) | 73.03 | 171.88 | +135.36% |
Latency 99th (ms) | 288.06 | 1,320.00 | +358.24% |