Fix for unexpected socket closures and data leakage under heavy load
This is to address issue #645 and in aiohttp/aiohappyeyeballs#93 and aiohttp/aiohappyeyeballs#112
@todddialpad very nice! Do you know does it help with the other issue https://github.com/MagicStack/uvloop/issues/506 which seems to be also related to incorrect sharing of sockets etc?
Any possibility to add some test here?
@todddialpad very nice! Do you know does it help with the other issue #506 which seems to be also related to incorrect sharing of sockets etc?
Any possibility to add some test here?
I am trying to get a stable test. It is tricky because it is a race condition, if my guess is correct. I think it is a race if TLS negotiation during a call to loop.create_connection with an explicit socket is cancelled, and a subsequent incoming connection is accepted before the CancelledError is propagated. I think both libuv (or uvloop) first and aiohttp second close the underlying file descriptor.
So if this is the case, I don't think this will fix issue #506 , which could be a similar but different root cause.
So if this is the case, I don't think this will fix issue https://github.com/MagicStack/uvloop/issues/506 , which could be a similar but different root cause.
Ok I see, the linked issue was also concerning as it looked as it was trying to write data into some incorrect socket. The error was also something we observed at similar time instances when we observed the response data getting leaked to incorrect requests. But we dont know is that issue actually related to the data leakage or just something else. (These RuntimeErrors dont happen with vanilla asyncio)
@todddialpad very nice! Do you know does it help with the other issue #506 which seems to be also related to incorrect sharing of sockets etc? Any possibility to add some test here?
I am trying to get a stable test. It is tricky because it is a race condition, if my guess is correct. I think it is a race if TLS negotiation during a call to
loop.create_connectionwith an explicit socket is cancelled, and a subsequent incoming connection is accepted before theCancelledErroris propagated. I think bothlibuv(oruvloop) first andaiohttpsecond close the underlying file descriptor.So if this is the case, I don't think this will fix issue #506 , which could be a similar but different root cause.
I still haven't been able to isolate a standalone, self-contained test. The test environment in which I generated the same error we see in production involves 2 VMs with significant network latency between them. The first of the VMs is just a web server, the second is a web server that accepts requests, and then makes outgoing client requests (using aiohttp) to the first webserver with TLS and a short timeout (around 1 second).
With this setup, I quite reliably get a failure within 250 connections. When I run with this patch applied, I have never had a failure in 20,000 connections.
We have also run this in our production environment. When we first encountered this failure, we hit it within 1 hour of using aiohttp >= 3.10. Since running with this patch we have been running for 5 days with no failures.
Is accepting this blocked on the tests that are failing? I don't think those failures are related to this change, as they are also failing for PR #644, which is solely a documentation change.
I looked at the test logs and I would guess that a dependency is causing the changed results. Related to this, I notice that in the failing tests, and alpha release of Cython 3.1 is being used (Using cached Cython-3.1.0a1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata). Is this intentional?
In the last test run that passed, the release version was used (Using cached Cython-3.0.11-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata)
Hello everyone. Did i think right that this MR fix issues below?
"RuntimeError: File descriptor 2877 is used by transport <TCPTransport closed=False reading=True 0x55b8dc9baa90>"
Hello everyone :)
Like many other users of this library, I would be happy for this fix to be implemented in one of the upcoming releases. Could you figure out roughly when to expect this fix?
@1st1 @fantix Is anyone available to merge and release this? We're getting asked to put workarounds into aiohttp to deal with this, would be nice to have this fixed here instead.
We added a workaround for this issue in https://github.com/aio-libs/aiohttp/pull/10464 but its causing issues when using with asyncio SelectorEventLoop https://github.com/aio-libs/aiohttp/issues/10617 so we will likely be reverting it and waiting for this PR instead
Hey @elprans @1st1, do you think you'll be able to spare a minute to get this in?
Hey @elprans @1st1, do you think you'll be able to spare a minute to get this in?
Sorry, @fantix and I will be going through this PR and others this week.
Hey! I noticed that aiohttp 3.11.14 has been yanked. For those of us using uvloop and aiohttp and running into the File descriptor 91 is used by transport error, do you happen to know if there’s a temporary workaround or a specific combination of versions we can pin to in the meantime? Totally understand if we need to wait for this to be merged, just trying to keep things running smoothly in the short term. Thanks a lot!
You can pin to a yanked version.
I folks We are also running into below issue,
File descriptor 91 is used by transport
Wonder if their is a fix
Our setup
aiohappyeyeballs==2.6.1
aiohttp==3.11.18
aiohttp-cors==0.8.1
uvloop==0.21.0
Pinning to aiohttp==3.11.14 solved it for me
Pinning to
aiohttp==3.11.14solved it for me
Just a heads-up: if you're using the default asyncio event loop (typically SelectorEventLoop), pinning to aiohttp==3.11.14 may introduce other issues due to some side effects in that version. If you're using uvloop exclusively, it's likely fine.
Ideally, we were hoping this PR would be merged to avoid relying on workarounds in aiohttp as we've already been down that road, had to revert, and don’t want a repeat. Unfortunately, this PR seems to have stalled.
Can we please prioritize this PR, it seems to be impacting many users
@fantix thanks for having a look at this. I am not sure what to make of the test failures. The failures seem to be all related to Unix transports and subprocess transports. The PR only should affect TCP transports. I'm trying to repro. My dev environment is Ubuntu / py3.12, and the tests that are failing here are passing there. For example:
test_process_send_signal_1 (test_process.Test_UV_Process.test_process_send_signal_1) ... ok
test_process_streams_basic_1 (test_process.Test_UV_Process.test_process_streams_basic_1) ... ok
test_process_streams_devnull (test_process.Test_UV_Process.test_process_streams_devnull) ... ok
test_process_streams_pass_fds (test_process.Test_UV_Process.test_process_streams_pass_fds) ... ok
Do you have any ideas on how to proceed?
They are breaking in the debug build, maybe try this:
https://github.com/MagicStack/uvloop/blob/96b7ed31afaf02800d779a395591da6a2c8c50e1/.github/workflows/tests.yml#L68-L71
- Tests / test (3.12, macos-latest) (pull_request)
Yes, I get the failures with the debug build, good eye. Thanks.
I have instrumented the changed code, and in a failing test, the modifications never even run (which makes sense since the test isn't creating any TCP connections).
I have built without this patch, and still see the failures with the debug build.
git clone --recursive https://github.com/magicstack/uvloop.git uvloop.official
cd uvloop.official/
python3 -m venv uvloop-dev
source uvloop-dev/bin/activate
pip install -e .[dev]
pip install psutil
make debug
make test
======================================================================
FAIL: test_process_streams_pass_fds (test_process.Test_UV_Process.test_process_streams_pass_fds) [Alive handle after test] (handle_name='UVProcessTransport')
----------------------------------------------------------------------
Traceback (most recent call last):
File "uvloop.official/uvloop/_testbase.py", line 142, in tearDown
self.assertEqual(
AssertionError: 1 != 0 : alive UVProcessTransport after test
So, could an upstream dependency have broken the debug build?
So, could an upstream dependency have broken the debug build?
Since the last successful test run, the following upstream dependencies have changed:
Cython-3.1.0 (was 3.0.12)
aiohttp-3.11.18 (was 3.11.16)
frozenlist-1.6.0 (was 1.5.0)
mypy_extensions-1.1.0 (was 1.0.0)
setuptools-80.3.1 (was 78.1.0)
I rebuilt using Cython-3.0.12 and the tests passed.
Would a manual execution of the tests on the main branch still pass (assuming it will grab Cython 3.1.0)?
Would a manual execution of the tests on the main branch still pass (assuming it will grab Cython 3.1.0)?
I forked the main branch and tried running the tests. It fails with Cython 3.1.0. I pinned Cython to < 3.1.0 and the tests pass. I included this PR, and with the pinned Cython, all tests pass.
So I believe this PR could be merged. I created an issue for Cython 3.1.0 #677 .
Hi checking back on this, any ETA on when it would be merged
Following this PR waiting for the fix
Hi folks any ETA on this
Hi Folks
can someone provide an update on this
We are facing an issue where we suddenly start seeing File descriptor 91 is used by transport on a running process and while the error happens the process is not able to serve traffic. This is impacting our service stability
We stopped using uvloop and didnt really observe any performance impact. Probably better to stop using it until the issue is fixed. Especially as we also observed information to get leaked under heavy load. (that issue is hard to reproduce locally)
Hi FolksHi Folks 大家好 can someone provide an update on this can someone provide an update on this有人能提供一下这个的最新情况吗 We are facing an issue where we suddenly start seeing We are facing an issue where we suddenly start seeing 我们遇到了一个问题,即突然开始看到
File descriptor 91 is used by transporton a running process and while the error happens the process is not able to serve traffic. This is impacting our service stability on a running process and while the error happens the process is not able to serve traffic. This is impacting our service stability在一个正在运行的进程上,当错误发生时,该进程无法处理流量。这正影响着我们的服务稳定性。
Hello, I also have this problem. This problem occurs when calling a third-party interface times out and is in a high-concurrency scenario. Have you solved it? Please advise.
Have you solved it? Please advise.
Uninstall uvloop? Several users have reported that the performance difference is small today, so if it's breaking your application...
Have you solved it?
Uninstall uvloop? Several users have reported that the performance difference is small today, so if it's breaking your application...
Yes we uninstalled it and did not observe any change in performance. (We process non-trivial amount of requests, +30K RPS, in highly concurrent servers.).