Pathological behavior triggered by "slow" DNS
Describe the bug
In environments where YARP has to (re-)establish a lot of connections to the same domain and DNS resolution times are non-zero, DNS resolution times can become excessive leading to pathological behavior.
This is triggered by dotnet serializing all DNS requests for a domain under the assumption there is a local resolver cache that will be fast after the first query completes. This is generally not true in Linux/Kubernetes.
To Reproduce
- Get a DNS resolver with some latency (~5ms or so are sufficient)
- Add a backend forces reconnects
- Put sufficient load on the instance (e.g. >200rps @ 5ms)
- Observe DNS metrics. The time for resolution of the domain will keep climbing.
Further technical details
Found with YARP 2.3.0, dotnet 8 on AKS (Kubernetes, Linux). Probably not a problem on systems with built-in resolver caches, but should be true for any system where DNS has some latency. Especially in Kubernetes many queries are "slow" due to the way they setup the default search behavior for local domains. Ich checked an the serialization behavior seems to still be in place in dotnet 10 preview.
In our case the pathological behavior was triggered by a relatively short term backend overload triggering lots of reconnects. Due to the high rps this was sufficient to drive DNS resolution times over the connection timeout making it pretty much impossible for the instance to recover.
Imo it might make sense to provide some form of mitigation,workaround or guidance in YARP for this. I am also considering reporting this to dotnet runtime as I think it was written under the assumption of being in a windows environment with very fast local caching.
This is being tracked under https://github.com/dotnet/runtime/issues/81023.
I'm not sure there's anything that'd be YARP-specific to this problem.
The general workaround when possible is to use ConnectCallback and avoid calling into Dns (e.g. do manual caching in process / use a managed DNS implementation).
Sorry, failed to find the upstream issue in my search.
Aren't most YARP users going to be at least somewhat affected by bad DNS resolution times due to this? Even in "normal" operation we are regularly seing notably elevated DNS resolution times because of this serialization behavior. The upstream issue has been known since Jan 2023 so waiting for a fix from there seems like it would be accepting quite a bit not very obvious to debug pain for YARP users.
To me it feels like this should be at least be a "known issue" in the documentation or something like that. Even better would be a safe, default enabled workaround/mitigation but I agree this is likely to hard to do right.
I wonder whether one could use the DnsDestinationResolver with a low refresh period - like 1s to make sure to catch changes quickly - as an alternative to the ConnectCallback approach. Would that cause any concern regarding overhead / endpoint churn or something? If that works it would seem like a lower burden workaround that could be offered for YARP users encountering this compared to some custom DNS caching / resolver logic.
Aren't most YARP users going to be at least somewhat affected by bad DNS resolution times due to this?
To a small degree? Probably. Not sure it's categorically different from other services where you're making lots of API calls behind the scenes though. Please note that I'm not trying to say that this isn't a real or serious issue, just that YARP isn't necessarily special in this case.
Note that the DNS cost is only going to show up as you're establishing many new connections. E.g. you can easily sustain 200 rps on a single HTTP/2 connection for hours, with DNS cost being ~ 0. Of course there might be scenarios where you do end up going through lots of connections, like proxying many short-lived HTTP/1 WebSockets.
I wonder whether one could use the DnsDestinationResolver with a low refresh period
Should be possible, yes.
Thanks for your prompt responses. Really appreciated.
I might be overestimating the general impact because it is quite visible in our specific setup. We can have quite spiky request rates requiring lots of new connections in a short amount of time. Also we have limits on the number of requests and the lifetime of connections in keep-alive.
Will give the DnsDestinationResolver approach a go.
Really hope upstream will fix this. Trying to prevent thread exhaustion is nice and all but having your supposedly highly parallel dotnet service blocked on effectively a single thread doing one DNS query after another is not ideal. Especially because it is such unexpected behavior. Will have to see if I can think of something more useful to contribute to the upstream discussion than a "me too" though :)