spring-boot
spring-boot copied to clipboard
Reactor netty in spring boot webflux application generates metrics with CancelledServerWebExchangeException exception and UNKNOWN outcome where are no issues
We have a bunch of spring boot webflux services in our project and almost all has this same issue. We use prometheus for metrics and track the success of the requests. However in those services from 1% to 20% http server requests metrics consists of outcome=UNKNOWN with exception=CancelledServerWebExchangeException while there are no other indications of any issues in server responses or indication that clients cancel that many requests. examples:
http_server_requests_seconds_count{exception="CancelledServerWebExchangeException",method="GET",outcome="UNKNOWN",platform="UNKNOWN",status="401",uri="UNKNOWN",} 87.0
http_server_requests_seconds_count{exception="CancelledServerWebExchangeException",method="GET",outcome="UNKNOWN",platform="UNKNOWN",status="200",uri="UNKNOWN",} 110.0
I successfully reproduced this locally with basic webflux application template and single controller bombarding with https://httpd.apache.org/docs/2.4/programs/ab.html : ab -n 15000 -c 50 http://localhost:8080/v1/hello
.
I tried substituting tomcat for netty and there were no more of these metrics logs.
While it seems it doesn't cause direct issues on services running in production, it still interferes in the correctness of the metrics and alerts. We can ignore all the UKNOWN outcomes but we can't know if those UNKNOWN come from actual server/client cancels or just this netty issue.
Someone already had this issue in the past but it was never resolved: https://stackoverflow.com/questions/69913027/webflux-cancelledserverwebexchangeexception-appears-in-metrics-for-seemingly-no
Versions used: SpringBoot: 2.7.2 and SpringBoot: 2.6.2, Kotlin: 1.7.10, JVM: 17
I created example repo to reproduce this issue: https://github.com/MBalciunas/spring-webflux-netty-metrics-issue
This looks like a duplicate of #29599. What would like the outcome of this issue to be?
It's not the duplicate of #29599 as that issue is concerned with the status of UNKNOWN metric. In this ticket I raise an issue that outcome=UNKNOWN with exception=CancelledServerWebExchangeException seems to be generated by metrics for possibly no apparent reason. As I provided in examples description, this only happens with Netty and not tomcat. So the outcome I'd like would be no more UNKNOWN metrics when there is no issue from anywhere or an answer if this is a correct behaviour. And if it is, why it happens when it seems it shouldn't.
CancelledServerWebExchangeException
doesn't happen with Tomcat, because Servlet containers are not notified when a client closes the connection. With Servlet containers, you're only getting an error if the server is failing to write to the client. This explains why you're getting this metric with Reactor Netty and not Tomcat.
This metric is produced when a client closes the connection prematurely. From a Framework perspective, we have no way to know if the client went away because the response took too much time and cancelled the call or if a intermediary network device is at fault. This is why we're using the outcome=UNKNOWN
tag for this. From this perspective, we're emitting data for an event you might be interested in. The alternative would be that we completely ignore those requests from the metrics, but some would consider this a bug if they're tracking down an actual problem and there is no data for those requests.
I'm happy to improve the situation. Any suggestion?
Note: if I remember correctly, apache bench opens and closes a connection for each requests (arguably, this is a strange approach for benchmarks). Benchmarking the TCP layer like this locally might explain why some connections might be closed before the exchange is complete, from a server perspective.
Just tried an example that @MBalciunas provided and for me is the same.
ab
reports that all 15k requests very successful but in spring actuar end-point there is 454 failed requests with exception: CancelledServerWebExchangeException
ab -n 15000 -c 50 -H "Authorization:Basic cHJvbWV0aGV1czpwcm9tZXRoZXVz" http://localhost:8080/v1/hello
This is ApacheBench, Version 2.3 <$Revision: 1901567 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking localhost (be patient)
Completed 1500 requests
Completed 3000 requests
Completed 4500 requests
Completed 6000 requests
Completed 7500 requests
Completed 9000 requests
Completed 10500 requests
Completed 12000 requests
Completed 13500 requests
Completed 15000 requests
Finished 15000 requests
Server Software:
Server Hostname: localhost
Server Port: 8080
Document Path: /v1/hello
Document Length: 5 bytes
Concurrency Level: 50
Time taken for tests: 127.981 seconds
Complete requests: 15000
Failed requests: 0
Total transferred: 4440000 bytes
HTML transferred: 75000 bytes
Requests per second: 117.20 [#/sec] (mean)
Time per request: 426.604 [ms] (mean)
Time per request: 8.532 [ms] (mean, across all concurrent requests)
Transfer rate: 33.88 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 1.1 0 27
Processing: 73 425 112.1 430 880
Waiting: 73 425 112.1 430 880
Total: 73 426 112.1 430 880
Percentage of the requests served within a certain time (ms)
50% 430
66% 468
75% 494
80% 511
90% 558
95% 603
98% 658
99% 694
100% 880 (longest request)
http_server_requests_seconds_count{exception="CancelledServerWebExchangeException",method="GET",outcome="UNKNOWN",status="200",uri="/v1/hello",} 454.0
http_server_requests_seconds_sum{exception="CancelledServerWebExchangeException",method="GET",outcome="UNKNOWN",status="200",uri="/v1/hello",} 197.189522582
http_server_requests_seconds_count{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/v1/hello",} 14651.0
http_server_requests_seconds_sum{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/v1/hello",} 6204.099123688
And also that seems weird that the sum of metrics is bigger that request count.
requests: 15000.
SUCESS + UNKNOWN outcomes sum: 454.0 + 14651.0 = 15105.
I'd like to focus on the issue being discussed here: in the case of connections closed prematurely by clients, how is the current behavior incorrect? What would be the expected behavior?
Note that apache bench is not highly regarded as a benchmarking tool in general and in this case, it's not even using the keep-alive feature.
I tried with another benchmarking tool wrk.
wrk -t12 -c100 -d15s -H "Authorization: Basic cHJvbWV0aGV1czpwcm9tZXRoZXVz" http://localhost:8080/v1/hello
http_server_requests_seconds_count{exception="CancelledServerWebExchangeException",method="GET",outcome="UNKNOWN",status="200",uri="UNKNOWN",} 95.0
http_server_requests_seconds_count{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/v1/hello",} 1455.0
12 threads and 100 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 934.85ms 182.49ms 1.96s 75.72%
Req/Sec 10.41 7.29 50.00 69.66%
1491 requests in 15.03s, 430.99KB read
Requests/sec: 99.23
Transfer/sec: 28.68KB
Still getting considerable amount of unknown responses even tho there are no errors or indications of client closes.
Considering there is a case of premature connection close, the behaviour like this is fine. But we are not seeing indications of any close numbers of connection cancels like metric indicates and that's why it seems those metrics are not reflecting reality.
So you don't think there's a possibility that some bug/issue arises from weblux/netty metrics side?
With your latest tests, is it still the case that this behavior cannot be reproduced without Spring Security?
Without Spring Security:
Running 30s test @ http://localhost:8080/v1/hello
12 threads and 12 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.06ms 6.74ms 142.67ms 97.99%
Req/Sec 3.65k 0.98k 9.92k 82.22%
1303702 requests in 30.02s, 103.19MB read
Requests/sec: 43432.94
Transfer/sec: 3.44MB
http_server_requests_seconds_count{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/v1/hello",} 1303702.0
And no UNKNOWN outcomes.
@MBalciunas Thanks - I'm reaching out to the Security team internally. To me it's somehow expected to have additional latency there as credentials checks are expected to take time/CPU resources. But the cancellation behavior I'm not familiar with in this case. I'll report back here as soon as I know more.
After reaching out to the security and testing things further, here's an update.
A colleague from the security team couldn't reproduce the issue on their (more powerful?) machine. I don't think this is really related to Spring Security anyway. I think the behavior is easier to reproduce when security is involved, because request processing takes longer as we're creating new sessions all over the place.
I've carefully reproduced this behavior without Spring Security being involved, by adding an artificial delay to a test controller method as follows:
@GetMapping("/")
public Mono<String> test() {
return Mono.just("Spring Framework").delayElement(Duration.ofSeconds(5));
}
The cancellation signals reliably happen at the end of the benchmark run. I think that what's happening is that the benchmarking tool is sending lots of concurrent requests and stops when the chosen count/duration is reached and then cancels all the remaining in-flight requests at that time. This explains why the request count is not the same in the benchmarking report and in the metrics count. This is also reported as an expected behavior by the wrk maintainers.
When running the benchmark for longer periods with an application in debug mode, my breakpoints in the metrics filter were only reached by the end of the benchmark. I think this explains the behavior we're seeing here. Unless there is another data point that could lead us to a bug in Spring or Reactor, I think it would be interesting to dig a bit more into your production use case and understand where those client disconnects could come from.
Could it be possible that a client (even WebClient
) could use a scatter-gather pattern where an issue with one of the remote calls causes the cancellation of others? Maybe a client is using a primary/fallback mechanism where the main service doesn't respond under a certain time limit and the client cancels the connection and uses a fallback response instead? (think Resilience4J and others). If some endpoints in particular are more susceptible of seeing those cancellations, I think this would support this theory.
Back to the actual metric recorded, let me know if you've got ideas to improve this part. From my perspective, the "UNKNOWN"
outcome and the exception name are consistent with what's happening: the client went away and we don't know why. I don't think suppressing those metrics would do much good as we could hide the symptom of a real issue. In the future, Servlet containers might provide this information and the experience would be more consistent here between servers.
After some more testing locally with other kotlin service java.net.http.HttpClient requests to example endpoint, these are the results:
1ms delay in endpoint
http_server_requests_seconds_count{exception="CancelledServerWebExchangeException",method="GET",outcome="UNKNOWN",status="200",uri="/v1/hello",} 199.0
http_server_requests_seconds_count{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/v1/hello",} 99827.0
responses:
status:{200=100000}
message:{Hello=100000}
no delay, but loop with print in endpoint
http_server_requests_seconds_count{exception="CancelledServerWebExchangeException",method="GET",outcome="UNKNOWN",status="200",uri="/v1/hello",} 91.0
http_server_requests_seconds_count{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/v1/hello",} 99923.0
responses:
status:{200=100000}
message:{Hello=100000}
no delay in endpoint
http_server_requests_seconds_count{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/v1/hello",} 100000.0
responses:
status:{200=100000}
message:{Hello=100000}
There are no indication of client cancels and those UNKNOWN metrics are logged consistently through, not only at the end. So based on this it really doesn't look like the issue is related to our production services or the delay time, because 1ms delay shouldn't have influence like this.
I don't think we're making progress here; we're changing the "load test" infrastructure with each comment and we're spending time on something that will not be useful to your problem. In my opinion, there are only three ways out of this:
- you consider that those
CancelledServerWebExchangeException
are just noise to your application and that they're a distraction; in this case, you can contribute aMeterFilter
in your application to filter those out. - you manage, under certain conditions, to find that something in this chain doesn't behave as it should: Netty (sending a channel event about client disconnecting) -> Reactor Netty (cancelling the reactive pipeline) -> Spring Boot (turning this signal into a metric). If you're using a local load test setup you might find a bug in one of those or even in your local TCP stack; chances are quite low here and even if you do, this might not be the problem you're seeing in production.
- you manage to collect data from production about client disconnecting (configuring debug logs to Netty or Reactor Netty, network traffic capture) and really understand why you're seeing this in production
In the meantime, I'm closing this issue as I can't justify spending more time on this right now.