High CPU usage when idle, possibly DNS related
Describe the issue: I'm seeing high CPU usage (4-6% on an 18 core/36 thread CPU) by "Service Host: DNS Client" associated with dask workers, even when they're not running tasks. Wireshark doesn't show any actual DNS traffic, even on loopback, but I can make the CPU usage start and stop reliably by starting or stopping the workers.
It seems to scale with the number of workers; if I start 100 workers it goes even higher.
Minimal Complete Verifiable Example:
mkdir testing
cd testing
uv init --python 3.12
uv add dask[distributed]
uv run dask scheduler
# In a separate window
uv run dask worker tcp://127.0.0.1:8786 --nworkers 36 --nthreads 1
Then observe CPU usage in task manager.
Environment:
- Dask version: 2025.11.0
- Python version: 3.12.10
- Operating System: Microsoft Windows [Version 10.0.19045.6456]
- Install method: uv
EDIT: about 160/second, not 16/second. So, my once/second theory doesn't explain the load.
I tried profiling an idle worker and found it spends a fair fraction of its running time calling psutil.net_io_counters().
On a hunch, I tried running psutil.net_io_counters() in a tight loop. It looks like, on my machine, net_io_counters() can run about 16 times per second on a single core, and this induces 2.4% CPU load in "DNS Client". I don't know how often this runs in a worker, but 36 workers checking net_io_counters each once per second would roughly match the load I was seeing.
Thanks for looking further into this. All Dask components (client, scheduler, workers) track their system metrics in order to make various decisions. So I wouldn't be surprised at all about this kind of information being looked up every second.
Specifically psutil.net_io_counters() is used here
https://github.com/dask/distributed/blob/c26f4cab5a3346a9972818dfe5c7479ff11aba94/distributed/system_monitor.py#L168-L176
high CPU usage (4-6% on an 18 core/36 thread CPU)
You're seeing 4-6% of all cores of the CPU, or 4-6% of one thread?
Dask is always going to introduce some amount of overhead, but if there's a more performant way to query the network bytes read/write stats then we can definitely make that change.
This may also be a Windows specific issue, maybe this psutil operation is slow on Windows for some reason. So we could potentially find an alternative way to get this information just on Windows.
4-6% across all CPUs, not of a single thread. (It's not literally pegging a single thread, it's spread across all of them.)
That's definitely a lot of unecessary overhead.
If someone has time to look at this I would start by exploring whether there is a more performant alternative to psutil.net_io_counters() on Windows to query the network read/write metrics.
I don't know if I chase this all the way, but I can poke a bit deeper.
On Windows, psutil.net_io_counters() calls GetAdaptersAddresses twice, once to get a buffer size and once to get data on the available interfaces. Then it calls GetIfEntry2 for each interface. I wrote a command line utility to make the same sequence of calls, but without any python object manipulation. It takes on the order of 6.5 msec to execute the call sequence, whether I do it from C or in a Python loop.
About 80% of the time is spent in GetAdaptersAddresses, which has the following documentation note: https://learn.microsoft.com/en-us/windows/win32/api/iphlpapi/nf-iphlpapi-getadaptersaddresses
The GetAdaptersAddresses function requires a significant amount of network resources and time to complete since all of the low-level network interface tables must be traversed.
It goes on to recommend pre-allocating a 15K buffer as an initial call rather than calling the function twice every time. (Although I have a lot of network interfaces on my machine, which may be exacerbating this issue generally. On my machine about a 25k buffer would be needed, not 15k). I changed my CLI to pre-allocate a buffer large enough for my machine, and the execution time dropped from ~6.5 msec to ~3.9 msec.
A smarter net_io_counters implementation could remember the previous buffer size and cut out one of the (expensive) calls to GetAdaptersAddresses the vast majority of the time.
Currently the code walks the list from GetAdaptersAddresses to get the interface indexes to pass to each call to GetIfEntry2. The only other information it uses is the friendly name, to set a dict key name. But psutils.net_io_counters() doesn't even expose that dict unless you ask for it with pernic=True, which the above code you linked to doesn't.
I was hoping it might be possible to just walk the interfaces by "guessing" the index numbers, but it looks like they're not sequential. If we didn't care about the exact displayed name for the interface, AND we had another way to get the list of interfaces, we could bypass GetAdaptersAddresses entirely. That may be a dead end.
However! I just found a family of flags GAA_FLAG_SKIP_DNS_INFO etc. to GetAdaptersAddresses. Setting almost* all of those SKIP flags, in conjunction with only calling GetAdaptersAddresses once, gets it down to ~1.7 msec. *I didn't skip friendly name, for compatibility with the "pernic=True" case, and in any event it had a negligible effect on runtime.
Hm. I'm definitely seeing the CPU load. But I think I must have typoed my initial finding of 16/sec in a tight loop; it's more like 160/sec. This is more consistent with the other numbers. But the once/second hypothesis, with 36 workers, doesn't add up to that much load. So I'm still failing to understand some aspect of this. Anyway, I found forwarded the speedup ideas to psutil.