spring-boot Reactor netty in spring boot webflux application generates metrics with CancelledServerWebExchangeException exception and UNKNOWN outcome where are no issues

We have a bunch of spring boot webflux services in our project and almost all has this same issue. We use prometheus for metrics and track the success of the requests. However in those services from 1% to 20% http server requests metrics consists of outcome=UNKNOWN with exception=CancelledServerWebExchangeException while there are no other indications of any issues in server responses or indication that clients cancel that many requests. examples: http_server_requests_seconds_count{exception="CancelledServerWebExchangeException",method="GET",outcome="UNKNOWN",platform="UNKNOWN",status="401",uri="UNKNOWN",} 87.0 http_server_requests_seconds_count{exception="CancelledServerWebExchangeException",method="GET",outcome="UNKNOWN",platform="UNKNOWN",status="200",uri="UNKNOWN",} 110.0

I successfully reproduced this locally with basic webflux application template and single controller bombarding with https://httpd.apache.org/docs/2.4/programs/ab.html : ab -n 15000 -c 50 http://localhost:8080/v1/hello.

I tried substituting tomcat for netty and there were no more of these metrics logs.

While it seems it doesn't cause direct issues on services running in production, it still interferes in the correctness of the metrics and alerts. We can ignore all the UKNOWN outcomes but we can't know if those UNKNOWN come from actual server/client cancels or just this netty issue.

Someone already had this issue in the past but it was never resolved: https://stackoverflow.com/questions/69913027/webflux-cancelledserverwebexchangeexception-appears-in-metrics-for-seemingly-no

Versions used: SpringBoot: 2.7.2 and SpringBoot: 2.6.2, Kotlin: 1.7.10, JVM: 17

Nov 22 '22 12:11 MBalciunas

I created example repo to reproduce this issue: https://github.com/MBalciunas/spring-webflux-netty-metrics-issue

Nov 25 '22 13:11 MBalciunas

This looks like a duplicate of #29599. What would like the outcome of this issue to be?

Nov 25 '22 13:11 bclozel

It's not the duplicate of #29599 as that issue is concerned with the status of UNKNOWN metric. In this ticket I raise an issue that outcome=UNKNOWN with exception=CancelledServerWebExchangeException seems to be generated by metrics for possibly no apparent reason. As I provided in examples description, this only happens with Netty and not tomcat. So the outcome I'd like would be no more UNKNOWN metrics when there is no issue from anywhere or an answer if this is a correct behaviour. And if it is, why it happens when it seems it shouldn't.

Nov 25 '22 13:11 MBalciunas

CancelledServerWebExchangeException doesn't happen with Tomcat, because Servlet containers are not notified when a client closes the connection. With Servlet containers, you're only getting an error if the server is failing to write to the client. This explains why you're getting this metric with Reactor Netty and not Tomcat.

This metric is produced when a client closes the connection prematurely. From a Framework perspective, we have no way to know if the client went away because the response took too much time and cancelled the call or if a intermediary network device is at fault. This is why we're using the outcome=UNKNOWN tag for this. From this perspective, we're emitting data for an event you might be interested in. The alternative would be that we completely ignore those requests from the metrics, but some would consider this a bug if they're tracking down an actual problem and there is no data for those requests.

I'm happy to improve the situation. Any suggestion?

Note: if I remember correctly, apache bench opens and closes a connection for each requests (arguably, this is a strange approach for benchmarks). Benchmarking the TCP layer like this locally might explain why some connections might be closed before the exchange is complete, from a server perspective.

Nov 25 '22 14:11 bclozel

Just tried an example that @MBalciunas provided and for me is the same. ab reports that all 15k requests very successful but in spring actuar end-point there is 454 failed requests with exception: CancelledServerWebExchangeException

ab -n 15000 -c 50 -H "Authorization:Basic cHJvbWV0aGV1czpwcm9tZXRoZXVz" http://localhost:8080/v1/hello
This is ApacheBench, Version 2.3 <$Revision: 1901567 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
Completed 1500 requests
Completed 3000 requests
Completed 4500 requests
Completed 6000 requests
Completed 7500 requests
Completed 9000 requests
Completed 10500 requests
Completed 12000 requests
Completed 13500 requests
Completed 15000 requests
Finished 15000 requests


Server Software:        
Server Hostname:        localhost
Server Port:            8080

Document Path:          /v1/hello
Document Length:        5 bytes

Concurrency Level:      50
Time taken for tests:   127.981 seconds
Complete requests:      15000
Failed requests:        0
Total transferred:      4440000 bytes
HTML transferred:       75000 bytes
Requests per second:    117.20 [#/sec] (mean)
Time per request:       426.604 [ms] (mean)
Time per request:       8.532 [ms] (mean, across all concurrent requests)
Transfer rate:          33.88 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   1.1      0      27
Processing:    73  425 112.1    430     880
Waiting:       73  425 112.1    430     880
Total:         73  426 112.1    430     880

Percentage of the requests served within a certain time (ms)
  50%    430
  66%    468
  75%    494
  80%    511
  90%    558
  95%    603
  98%    658
  99%    694
 100%    880 (longest request)

http_server_requests_seconds_count{exception="CancelledServerWebExchangeException",method="GET",outcome="UNKNOWN",status="200",uri="/v1/hello",} 454.0
http_server_requests_seconds_sum{exception="CancelledServerWebExchangeException",method="GET",outcome="UNKNOWN",status="200",uri="/v1/hello",} 197.189522582
http_server_requests_seconds_count{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/v1/hello",} 14651.0
http_server_requests_seconds_sum{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/v1/hello",} 6204.099123688

Nov 25 '22 14:11 gzeskas

And also that seems weird that the sum of metrics is bigger that request count.

requests: 15000.
SUCESS + UNKNOWN outcomes sum: 454.0 + 14651.0 = 15105.

Nov 25 '22 14:11 MBalciunas

I'd like to focus on the issue being discussed here: in the case of connections closed prematurely by clients, how is the current behavior incorrect? What would be the expected behavior?

Note that apache bench is not highly regarded as a benchmarking tool in general and in this case, it's not even using the keep-alive feature.

Nov 25 '22 14:11 bclozel

I tried with another benchmarking tool wrk. wrk -t12 -c100 -d15s -H "Authorization: Basic cHJvbWV0aGV1czpwcm9tZXRoZXVz" http://localhost:8080/v1/hello

http_server_requests_seconds_count{exception="CancelledServerWebExchangeException",method="GET",outcome="UNKNOWN",status="200",uri="UNKNOWN",} 95.0
http_server_requests_seconds_count{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/v1/hello",} 1455.0

12 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   934.85ms  182.49ms   1.96s    75.72%
    Req/Sec    10.41      7.29    50.00     69.66%
  1491 requests in 15.03s, 430.99KB read
Requests/sec:     99.23
Transfer/sec:     28.68KB

Still getting considerable amount of unknown responses even tho there are no errors or indications of client closes.

Considering there is a case of premature connection close, the behaviour like this is fine. But we are not seeing indications of any close numbers of connection cancels like metric indicates and that's why it seems those metrics are not reflecting reality.

So you don't think there's a possibility that some bug/issue arises from weblux/netty metrics side?

Nov 28 '22 12:11 MBalciunas

With your latest tests, is it still the case that this behavior cannot be reproduced without Spring Security?

Nov 28 '22 13:11 bclozel

Without Spring Security:


Running 30s test @ http://localhost:8080/v1/hello
  12 threads and 12 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.06ms    6.74ms 142.67ms   97.99%
    Req/Sec     3.65k     0.98k    9.92k    82.22%
  1303702 requests in 30.02s, 103.19MB read
Requests/sec:  43432.94
Transfer/sec:      3.44MB

http_server_requests_seconds_count{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/v1/hello",} 1303702.0 And no UNKNOWN outcomes.

Nov 28 '22 13:11 MBalciunas

@MBalciunas Thanks - I'm reaching out to the Security team internally. To me it's somehow expected to have additional latency there as credentials checks are expected to take time/CPU resources. But the cancellation behavior I'm not familiar with in this case. I'll report back here as soon as I know more.

Nov 28 '22 13:11 bclozel

After reaching out to the security and testing things further, here's an update.

A colleague from the security team couldn't reproduce the issue on their (more powerful?) machine. I don't think this is really related to Spring Security anyway. I think the behavior is easier to reproduce when security is involved, because request processing takes longer as we're creating new sessions all over the place.

I've carefully reproduced this behavior without Spring Security being involved, by adding an artificial delay to a test controller method as follows:

@GetMapping("/")
public Mono<String> test() {
	return Mono.just("Spring Framework").delayElement(Duration.ofSeconds(5));
}

The cancellation signals reliably happen at the end of the benchmark run. I think that what's happening is that the benchmarking tool is sending lots of concurrent requests and stops when the chosen count/duration is reached and then cancels all the remaining in-flight requests at that time. This explains why the request count is not the same in the benchmarking report and in the metrics count. This is also reported as an expected behavior by the wrk maintainers.

When running the benchmark for longer periods with an application in debug mode, my breakpoints in the metrics filter were only reached by the end of the benchmark. I think this explains the behavior we're seeing here. Unless there is another data point that could lead us to a bug in Spring or Reactor, I think it would be interesting to dig a bit more into your production use case and understand where those client disconnects could come from.

Could it be possible that a client (even WebClient) could use a scatter-gather pattern where an issue with one of the remote calls causes the cancellation of others? Maybe a client is using a primary/fallback mechanism where the main service doesn't respond under a certain time limit and the client cancels the connection and uses a fallback response instead? (think Resilience4J and others). If some endpoints in particular are more susceptible of seeing those cancellations, I think this would support this theory.

Back to the actual metric recorded, let me know if you've got ideas to improve this part. From my perspective, the "UNKNOWN" outcome and the exception name are consistent with what's happening: the client went away and we don't know why. I don't think suppressing those metrics would do much good as we could hide the symptom of a real issue. In the future, Servlet containers might provide this information and the experience would be more consistent here between servers.

Nov 28 '22 20:11 bclozel

After some more testing locally with other kotlin service java.net.http.HttpClient requests to example endpoint, these are the results:

1ms delay in endpoint
http_server_requests_seconds_count{exception="CancelledServerWebExchangeException",method="GET",outcome="UNKNOWN",status="200",uri="/v1/hello",} 199.0
http_server_requests_seconds_count{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/v1/hello",} 99827.0
responses:
	status:{200=100000}
	message:{Hello=100000}

no delay, but loop with print in endpoint
http_server_requests_seconds_count{exception="CancelledServerWebExchangeException",method="GET",outcome="UNKNOWN",status="200",uri="/v1/hello",} 91.0
http_server_requests_seconds_count{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/v1/hello",} 99923.0
responses:
	status:{200=100000}
	message:{Hello=100000}

no delay in endpoint
http_server_requests_seconds_count{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/v1/hello",} 100000.0
responses:
	status:{200=100000}
	message:{Hello=100000}

There are no indication of client cancels and those UNKNOWN metrics are logged consistently through, not only at the end. So based on this it really doesn't look like the issue is related to our production services or the delay time, because 1ms delay shouldn't have influence like this.

Dec 05 '22 10:12 MBalciunas

I don't think we're making progress here; we're changing the "load test" infrastructure with each comment and we're spending time on something that will not be useful to your problem. In my opinion, there are only three ways out of this:

you consider that those CancelledServerWebExchangeException are just noise to your application and that they're a distraction; in this case, you can contribute a MeterFilter in your application to filter those out.
you manage, under certain conditions, to find that something in this chain doesn't behave as it should: Netty (sending a channel event about client disconnecting) -> Reactor Netty (cancelling the reactive pipeline) -> Spring Boot (turning this signal into a metric). If you're using a local load test setup you might find a bug in one of those or even in your local TCP stack; chances are quite low here and even if you do, this might not be the problem you're seeing in production.
you manage to collect data from production about client disconnecting (configuring debug logs to Netty or Reactor Netty, network traffic capture) and really understand why you're seeing this in production

In the meantime, I'm closing this issue as I can't justify spending more time on this right now.

Dec 05 '22 11:12 bclozel

spring-boot spring-boot copied to clipboard

Reactor netty in spring boot webflux application generates metrics with CancelledServerWebExchangeException exception and UNKNOWN outcome where are no issues

spring-boot
spring-boot copied to clipboard