trio icon indicating copy to clipboard operation
trio copied to clipboard

trio.socket.getaddrinfo may invoke a worker thread when it doesn't need to

Open zackw opened this issue 4 years ago • 3 comments

trio.socket.getaddrinfo tries to avoid invoking a worker thread when it knows that stdlib getaddrinfo won't block: the host is already an IP address and the port is already numeric. It does this by calling stdlib getaddrinfo with AI_NUMERICHOST|AI_NUMERICSERV added to the flags. If that call succeeds, it returns immediately, otherwise it calls getaddrinfo in a worker thread.

The problem is that trio assumes that a failure of this initial call is because we need to do a host and/or service lookup. This is not necessarily true. The most important case is when the caller of trio.socket.getaddrinfo already set AI_NUMERICHOST|AI_NUMERICSERV in the flags; in that case the call in the worker thread is going to fail in the exact same way that the call in the main thread failed, and there's no point going through the thread. Failures for unrelated reasons (e.g. inappropriate address family) will also fail the same way in the worker thread Failures are only retried in the worker thread if the getaddrinfo error code is EAI_NONAME, so we only need to worry about the cases where that can happen, but there are still several types of invalid "host" and/or "port" argument that will produce EAI_NONAME and there's no point retrying, even if caller didn't set AI_NUMERICHOST|AI_NUMERICSERV.

To fix this, getaddrinfo just needs to pay a little more attention to why the initial call failed and what the caller's flags were, and not retry calls that will provably fail the same way in the worker. This will also let us tell the authors of custom HostResolver implementations that they will not be called when both the host and the service were already numeric (but it's still possible for them to get an IP address with a service keyword, or a domain name with a numeric port) and they can rely on the rest of the arguments to have been validated already -- in particular they do not need to worry about the address family being something other than AF_INET, AF_INET6, or AF_UNSPEC.

I plan to work on this myself but I am filing this issue in advance to get advice about testing. Is there a good way to test "this operation does not do anything in a worker thread" already? I suppose I could define a test HostResolver that bombs out if called, since the logic is "call custom HR or else call stdlib getaddrinfo in a worker" if we get to that point, but maybe someone has a better idea?

zackw avatar Oct 14 '19 18:10 zackw

Huh. That's some impressive attention to edge cases. I approve :-). Is also appreciate hearing more about what kind of use case makes this detail important!

Our getaddrinfo tests already monkeypatch the stdlib socket.getaddrinfo in order to check that our trio.socket.getaddrinfo is invoking it the way we expect. We could use that strategy to test this too, I think. (For example, install a fake socket.getaddrinfo that records which thread it was called in.)

njsmith avatar Oct 14 '19 19:10 njsmith

Is also appreciate hearing more about what kind of use case makes this detail important!

It's not actually about the threads, for me. I was writing a HostResolver implementation and I was surprised to get called when I had passed AI_NUMERICHOST|AI_NUMERICSERV along with a domain name (from the tests).

The larger goal is to substitute dnspython and trio-native networking for the stdlib getaddrinfo, in order to log exact details of the DNS traffic as it comes on and off the wire, and also to send every query to a bunch of different DNS servers at the same time and compare the responses. (The even larger goal is Internet censorship monitoring, see https://arxiv.org/abs/1907.04245 .)

zackw avatar Oct 14 '19 20:10 zackw

Not sure if this is related, but I am seeing Trio freeze up completely in "await socket.getnameinfo((ip, 0), 0)" until it eventually throws a "transient failure" exception. (By freeze up completely, I mean--presumably--something that should have gone to a worker thread hasn't: My entire application hangs for a second or two until the call bails.) Note this call normally succeeds and everything works fine but once in a blue moon it hangs and ultimately fails.

[Update: More like 5-10 seconds. And the exact error is "[Errno -3] Temporary failure in name resolution"]

brandyn avatar May 07 '23 21:05 brandyn