aiodns icon indicating copy to clipboard operation
aiodns copied to clipboard

inotify leaks

Open justtempusername opened this issue 3 months ago • 12 comments

Still seeing Failed to create DNS resolver channel… spam on aiodns== 3.5.0 with pycares 4.10.0

After upgrading from aiodns==3.2.0/aiohttp==3.11.18 to aiodns==3.5.0 + aiohttp==3.12.15 + pycares==4.10.0, logs fill with: Failed to create DNS resolver channel with automatic monitoring of resolver configuration changes. This usually means the system ran out of inotify watches. Falling back to socket state callback. Consider increasing the system inotify watch limit: Failed to initialize c-ares channel

Environment:

Ubuntu 22.04.5 LTS, Python 3.13.1
uvloop==0.21.0
aiohttp[speedups]==3.12.15
aiodns==3.5.0
pycares==4.10.0
aiohttp-socks==0.10.1

Workload: many short-lived ClientSessions; some requests via ProxyConnector(..., rdns=False)

Problem starts at aiodns==3.3.0

I don't yet understand how to reproduce the problem locally, but in the production environment the error appears every time

I use this code for requests:

      async with aiohttp.ClientSession(
          json_serialize=orjson.dumps,
          timeout=aiohttp.ClientTimeout(total=9, connect=1.5, sock_connect=1.5),
          connector=ProxyConnector.from_url(f"http://{proxy}", rdns=False),
      ) as sess:
          async with sess.get(

justtempusername avatar Sep 06 '25 06:09 justtempusername

What inotify limits do you have?

saghul avatar Sep 06 '25 06:09 saghul

cat /proc/sys/fs/inotify/max_user_watches
1048576
cat /proc/sys/fs/inotify/max_user_instances
128
cat /proc/sys/fs/inotify/max_queued_events
32768

I tried increasing these values, and the spam disappears for a while, but then reappears

justtempusername avatar Sep 06 '25 06:09 justtempusername

def wrapper_for_run_cpu_bound(func, *args, **kwargs):
    new_loop = asyncio.new_event_loop()
    new_loop.slow_callback_duration = 10
    asyncio.set_event_loop(new_loop)

    try:
        new_loop.run_until_complete(func(*args, **kwargs))
    except Exception:
        logger.error(traceback.format_exc())
pool_process = ProcessPoolExecutor(max_workers=230)
            asyncio.get_running_loop().run_in_executor(
                pool_process,
                wrapper_for_run_cpu_bound,
                func,
                args,
            )

I have a separate process for each user with its own event loop. Requests are made in these processes via aiohttp, as well as in the main event loop.

justtempusername avatar Sep 06 '25 06:09 justtempusername

Can you please create a standalone test case that exhibits the problem?

To be clear, these are not leaks, you are just creating too many resolvers. Not sure how much sense it makes to have separate process pools which then run things in a thread pool each, that's a lot of resources...

saghul avatar Sep 06 '25 08:09 saghul

I'm trying to figure out how to create a test case. I need this implementation because I have a lot of CPU-bound work. So why does everything work in version 3.2.0?

Is there a way to see what exactly inotify watch creates and how many are currently active? Through Python code?

I set sudo sysctl fs.inotify.max_user_instances=15000 After a while, messages about overflow reappear. I doubt that I could have that many. All aiohttp sessions are closed

justtempusername avatar Sep 06 '25 09:09 justtempusername

I use this code for requests:

This has been discussed in other issues and it seems there's no safe way to work around this kind of code. The expected usage for aiohttp is to create a ClientSession and then reuse that for all (or atleast many) requests. If you create a new ClientSession for every request then you're creating a new resolver for every request. My rough understanding from the existing issues on this is that this creates a race condition where you end up opening more before the last ones are fully closed (and c-ares doesn't have an interface that can guarantee us that it's closed at the correct time).

Dreamsorcerer avatar Sep 06 '25 13:09 Dreamsorcerer

import asyncio
import traceback
from concurrent.futures import ProcessPoolExecutor

import aiodns
from loguru import logger

pool_process = ProcessPoolExecutor(max_workers=230)


def wrapper_for_run_cpu_bound(func, *args, **kwargs):
    new_loop = asyncio.new_event_loop()
    new_loop.slow_callback_duration = 10
    asyncio.set_event_loop(new_loop)

    try:
        new_loop.run_until_complete(func(*args, **kwargs))
    except Exception:
        logger.error(traceback.format_exc())


async def create_res():
    resolver = aiodns.DNSResolver()
    await resolver.close()


async def run_chunks(coros, chunk_size, return_exceptions=False):
    if not coros:
        return []

    chunks = [coros[i : i + chunk_size] for i in range(0, len(coros), chunk_size)]
    results = []

    if not chunks:
        return []

    for chunk in chunks:
        try:
            chunk_results = await asyncio.gather(
                *chunk, return_exceptions=return_exceptions
            )
            results.extend(chunk_results)
        except AttributeError:
            logger.debug(traceback.format_exc())

    return results


async def test(num):
    coroutines = [create_res() for _ in range(300)]
    await run_chunks(coroutines, chunk_size=10)


async def main():
    for x in range(9):
        asyncio.get_running_loop().run_in_executor(
            pool_process, wrapper_for_run_cpu_bound, test, x
        )


if __name__ == "__main__":
    asyncio.run(main())

I tried to roughly recreate the logic of the application. And I got an warn with this code

The limit is 128, but in theory only 90 can be occupied at the same time. Apparently, the error is somehow related to a race condition?

justtempusername avatar Sep 06 '25 13:09 justtempusername

My rough understanding from the existing issues on this is that this creates a race condition where you end up opening more before the last ones are fully closed (and c-ares doesn't have an interface that can guarantee us that it's closed at the correct time).

That's no longer the case. The problem is simply that if you create resources faster than they can be deallocated, you'll eventually run out of resources.

To be blunt, I'm not really inclined to look at the sample code above, since this is not representative of a realistic codebase:

async def create_res():
    resolver = aiodns.DNSResolver()
    await resolver.close()

As I said above, creating the resolver is cheaper than destroying it, so by doing that you'll reach the file descriptor limit.

saghul avatar Sep 22 '25 08:09 saghul

The problem is simply that if you create resources faster than they can be deallocated, you'll eventually run out of resources.

Yes, but the race condition is that the deallocation isn't complete when the .close() method returns, right? I think bdraco mentioned that there's an unfortunate number of projects that do create aiohttp.ClientSession short-lived that may hit this problem.

Dreamsorcerer avatar Sep 22 '25 12:09 Dreamsorcerer

That's no longer the case, I fixed that in pycares by waiting for all queries in the channel to end. The actual destruction happens in a different thread, so that's why it takes a bit longer to destroy than to create, but there is no race condition that I'm aware of.

saghul avatar Sep 22 '25 14:09 saghul

TLDR: I can't reproduce the inotify leaks anymore after aghul's fix at high volume unless the system is already in a bad state.

The only time I've seen the problem happen after saghul's fix is at very high volume now when the system is cpu/resource starved and the destruction thread can't keep up. However in that case the system was already experiencing other failures due to being overloaded so its an expected failure mode.

bdraco avatar Sep 22 '25 14:09 bdraco

Here is a possible idea, but I'm not a huge fan of it, because it hides the real problem, aka people creating tons of resolvers for no reason: cache the resolvers if created with default values.

If no nameservers or kwargs are given, we could store that resolver as the default resolver for a given loop, and return that, rather than return a new one.

In the pathological case, this would solve a problem.

I'm not going to work on this myself, but if someone is so inclined, I'd review a PR :-)

saghul avatar Sep 22 '25 14:09 saghul