asyncio
asyncio copied to clipboard
Add an optional cache for loop.getaddrinfo()
Hi,
I tried the crawl.py example, and I noticed that it solves the host for each
connection. For example, on my PC the script calls getaddrinfo() 160 times per
second. It looks like each call sends a DNS request (a real UDP packet) to the
DNS server. With DNSSEC enabled, it may even need to open a new TCP connection
for each DNS resolution.
Would it make sense for write an optional cache for DNS resolution in
BaseEventLoop? Or at least in crawl.py?
The common problem with cache is to configure it: number of cached results?
timeout? The DNS protocol provides the timeout: the TTL field of a resource
record (RR), which is a number of seconds. But getaddrinfo() API doesn't this
value.
For example, Firefox caches 20 DNS results during 60 seconds by default.
http://kb.mozillazine.org/Network.dnsCacheExpiration
http://kb.mozillazine.org/Network.dnsCacheEntries
Info on DNS resolution in Chromium:
http://www.chromium.org/developers/design-documents/dns-prefetching
An old article (2011) says that Internet Explorer used a timeout of 24 hours,
and it now uses a timeout of 30 minutes:
http://support.microsoft.com/kb/263558/en
See also the issue #160 (Asynchronous DNS client). It is not directly related
because I don't see any option to cache results in these async DNS clients.
Original issue reported on code.google.com by [email protected] on 6 Mar 2014 at 5:01
I've always felt uncomfortable with that, but on the systems where I've tried
it, I believe getaddrinfo() has a cache of its own, because there's no
noticeable delay. Feel free to contribute something!
Original comment by [email protected] on 6 Mar 2014 at 5:13
"I've always felt uncomfortable with that, but on the systems where I've tried
it, I believe getaddrinfo() has a cache of its own, because there's no
noticeable delay."
In fact, it looks fast. But using strace, I was surprised to see that it really
sends UDP packets to the DNS server of my ISP. I suppose that crawl.py could be
even faster with a DNS cache.
Original comment by [email protected] on 6 Mar 2014 at 5:29
On my OSX laptop, using timeit, I find that getaddrinfo("xkcd.com", 80) takes
an average of 156 usec. I don't believe that's enough for a UDP roundrip
anywhere. Maybe it depends on how your OS resolver library is configured?
(I don't doubt that you're seeing what you're reporting -- I'm just doubting
that when *I* run crawl.py xkcd.com and it takes 7 seconds to complete the DNS
lookups are slowing me down.)
Original comment by [email protected] on 6 Mar 2014 at 5:53
On my Fedora 20:
$ python3 -m timeit -s 'import socket' 'socket.getaddrinfo("xkcd.com", 80)'
10 loops, best of 3: 28.9 msec per loop
Obviously, your OSX has a cache :-) According to Chromium doc, Windows has also
a DNS cache:
http://www.chromium.org/developers/design-documents/dns-prefetching
Original comment by [email protected] on 7 Mar 2014 at 2:32
FYI there are various DNS cache daemons on Linux: dnsmasq, unbound, nscd, sssd.
On Fedora, Network Manager configures dnsmasq. On my setup, dnsmasq is not run
because I disabled Network Manager to setup a bridge to run virtual machines.
Original comment by [email protected] on 16 Jun 2014 at 10:05