Sudden spike in TCP Close Wait
Describe the bug
We started seeing a spike in TCP Close Wait / Http Queue Length (+4k) on one of our Azure App Services for Windows running YARP. Scaling out to one more instance (2->3) fixed the issue temporarily, after scaling down we had another spike. This is the first occurence of this issue in the last ~4 years we've been running YARP. Underlying infrastructure or code hasn't been changed in the last 2 months.
Further technical details
- Azure App Service on Windows (P2v3), VNET Integration / PaaS access over Private Endpoints. Some calls are made to a AFD load-balanced service (Eg. public endpoint which are subject to SNAT restrictions)
- YARP 2.1, default HttpClient settings
- The AFD load-balanced service is reached with a typed HttpClient using a handler lifetime of 90 seconds (Down from 2 minutes)
- NET8
- YARP forwards requests to ~20 clusters (Unique destinations, also Azure App Services) with traffic being about 150 requests/s in total
- Requests on a few specific endpoints may take over 5-10 seconds, but those are limited.
Could there be a HttpClient setting we can optizime to release / close HttpClients earlier?
You can control most of the timeouts on the HttpClient via YARP HTTP Client Configuration. There are options on SocketsHttpHandler for the idle timeout of the connection pool, web sockets etc, it all depends on exactly what scenario the connections are being held open for.
While YARP and your infrastructure have not changed, there are implicit proxies in the chain for App service and possibly AFD. The Azure infrastructure can be updated without any changes on your end. If you have more details on what the state of the connections are with too long a duration, we can probably start to track down what changed.
@samsp-msft Thanks for your reply. We've also opened an Azure Support request but so far, no new insights could be gained there, Azure aparently does not have information about outgoing connections / TCP stats. We are prohibited from running netstat on the Kudu console so we'll have to find another way of identifying those outgoing calls responsible. The other public endpoint we're connecting to is App Insights for telemetry data but I find that unlikely to be the cause.