asyncio icon indicating copy to clipboard operation
asyncio copied to clipboard

Add an optional cache for loop.getaddrinfo()

Open GoogleCodeExporter opened this issue 10 years ago • 5 comments

Hi,

I tried the crawl.py example, and I noticed that it solves the host for each 
connection. For example, on my PC the script calls getaddrinfo() 160 times per 
second. It looks like each call sends a DNS request (a real UDP packet) to the 
DNS server. With DNSSEC enabled, it may even need to open a new TCP connection 
for each DNS resolution.

Would it make sense for write an optional cache for DNS resolution in 
BaseEventLoop? Or at least in crawl.py?

The common problem with cache is to configure it: number of cached results? 
timeout? The DNS protocol provides the timeout: the TTL field of a resource 
record (RR), which is a number of seconds. But getaddrinfo() API doesn't this 
value.

For example, Firefox caches 20 DNS results during 60 seconds by default.
http://kb.mozillazine.org/Network.dnsCacheExpiration
http://kb.mozillazine.org/Network.dnsCacheEntries

Info on DNS resolution in Chromium:
http://www.chromium.org/developers/design-documents/dns-prefetching

An old article (2011) says that Internet Explorer used a timeout of 24 hours, 
and it now uses a timeout of 30 minutes:
http://support.microsoft.com/kb/263558/en

See also the issue #160 (Asynchronous DNS client). It is not directly related 
because I don't see any option to cache results in these async DNS clients.

Original issue reported on code.google.com by [email protected] on 6 Mar 2014 at 5:01

GoogleCodeExporter avatar Apr 10 '15 16:04 GoogleCodeExporter

I've always felt uncomfortable with that, but on the systems where I've tried 
it, I believe getaddrinfo() has a cache of its own, because there's no 
noticeable delay. Feel free to contribute something!

Original comment by [email protected] on 6 Mar 2014 at 5:13

GoogleCodeExporter avatar Apr 10 '15 16:04 GoogleCodeExporter

"I've always felt uncomfortable with that, but on the systems where I've tried 
it, I believe getaddrinfo() has a cache of its own, because there's no 
noticeable delay."

In fact, it looks fast. But using strace, I was surprised to see that it really 
sends UDP packets to the DNS server of my ISP. I suppose that crawl.py could be 
even faster with a DNS cache.

Original comment by [email protected] on 6 Mar 2014 at 5:29

GoogleCodeExporter avatar Apr 10 '15 16:04 GoogleCodeExporter

On my OSX laptop, using timeit, I find that getaddrinfo("xkcd.com", 80) takes 
an average of 156 usec. I don't believe that's enough for a UDP roundrip 
anywhere. Maybe it depends on how your OS resolver library is configured?

(I don't doubt that you're seeing what you're reporting -- I'm just doubting 
that when *I* run crawl.py xkcd.com and it takes 7 seconds to complete the DNS 
lookups are slowing me down.)

Original comment by [email protected] on 6 Mar 2014 at 5:53

GoogleCodeExporter avatar Apr 10 '15 16:04 GoogleCodeExporter

On my Fedora 20:

$ python3 -m timeit -s 'import socket' 'socket.getaddrinfo("xkcd.com", 80)'
10 loops, best of 3: 28.9 msec per loop

Obviously, your OSX has a cache :-) According to Chromium doc, Windows has also 
a DNS cache:
http://www.chromium.org/developers/design-documents/dns-prefetching

Original comment by [email protected] on 7 Mar 2014 at 2:32

GoogleCodeExporter avatar Apr 10 '15 16:04 GoogleCodeExporter

FYI there are various DNS cache daemons on Linux: dnsmasq, unbound, nscd, sssd. 
On Fedora, Network Manager configures dnsmasq. On my setup, dnsmasq is not run 
because I disabled Network Manager to setup a bridge to run virtual machines.

Original comment by [email protected] on 16 Jun 2014 at 10:05

GoogleCodeExporter avatar Apr 10 '15 16:04 GoogleCodeExporter