aiohttp icon indicating copy to clipboard operation
aiohttp copied to clipboard

Task exception was never retrieved on v3.8.3

Open w0rmr1d3r opened this issue 3 years ago • 42 comments

Describe the bug

We bumped from v3.8.1 to v3.8.3 this morning and started seeing the following errors. No other changes were done to our project:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/aiohttp/web_protocol.py", line 505, in start
    request = self._request_factory(message, payload, self, writer, handler)
  File "/usr/local/lib/python3.9/site-packages/aiohttp/web_app.py", line 446, in _make_request
    return _cls(
  File "/usr/local/lib/python3.9/site-packages/aiohttp/web_request.py", line 811, in __init__
    super().__init__(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/aiohttp/web_request.py", line 188, in __init__
    assert transport is not None
AssertionError

And

Task exception was never retrieved
future: <Task finished name='Task-30396' coro=<RequestHandler.start() done, defined at /usr/local/lib/python3.9/site-packages/aiohttp/web_protocol.py:462> exception=AssertionError()>

To Reproduce

  1. We were using v3.8.1
  2. Bump to v3.8.3

No other changes introduced from our side, those errors started appearing.

Expected behavior

For that exception to be handled.

Logs/tracebacks

Added them on the description, adding them here as well:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/aiohttp/web_protocol.py", line 505, in start
    request = self._request_factory(message, payload, self, writer, handler)
  File "/usr/local/lib/python3.9/site-packages/aiohttp/web_app.py", line 446, in _make_request
    return _cls(
  File "/usr/local/lib/python3.9/site-packages/aiohttp/web_request.py", line 811, in __init__
    super().__init__(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/aiohttp/web_request.py", line 188, in __init__
    assert transport is not None
AssertionError

And

Task exception was never retrieved
future: <Task finished name='Task-30396' coro=<RequestHandler.start() done, defined at /usr/local/lib/python3.9/site-packages/aiohttp/web_protocol.py:462> exception=AssertionError()>

Python Version

$ python --version
3.9.13

aiohttp Version

$ python -m pip show aiohttp
3.8.3

multidict Version

$ python -m pip show multidict
6.0.2

yarl Version

$ python -m pip show yarl
1.8.1

OS

Docker image

Related component

Server

Additional context

I have added as much context as I had, if I left out anything relevant, happy to look for it and add it to the issue.

Code of Conduct

  • [X] I agree to follow the aio-libs Code of Conduct

w0rmr1d3r avatar Sep 27 '22 13:09 w0rmr1d3r

Initial investigation of the exception not retrieved: The task is created here: https://github.com/aio-libs/aiohttp/blob/master/aiohttp/web_protocol.py#L305 Then it looks like it should get awaited at: https://github.com/aio-libs/aiohttp/blob/master/aiohttp/web_protocol.py#L288

If an exception is raised, then .done() would be True, so the await doesn't happen. My initial thought is that there should not be a .done() check on that code.

Not investigated anything about the assertion error though.

Dreamsorcerer avatar Sep 27 '22 16:09 Dreamsorcerer

I've also just recently seen the exact error described by @w0rmr1d3r pop up, seemingly at random. Will report back if we're able to get any more clarity or fixes on our end. Thanks everyone for looking :)

nrothGIT avatar Oct 05 '22 20:10 nrothGIT

Same issue after updating to v3.8.3

doublex avatar Oct 18 '22 20:10 doublex

Same issue after updating to v3.8.3

Indeed @doublex . This started appearing in our services when upgrading to v3.8.3 It doesn't happen with v3.8.1

w0rmr1d3r avatar Oct 19 '22 08:10 w0rmr1d3r

There aren't too many changes (if you ignore formatting/typing changes), so if anyone can take a look and figure out which change is the issue, that'd be great.

These 2 removed lines are the only ones that stand out to me, so maybe try adding them back in first: https://github.com/aio-libs/aiohttp/compare/v3.8.1...v3.8.3#diff-2126b277e07e3fdd05e7a81da456accf24e5515a46c78c48a79db4eb166043e4L302-L303

Dreamsorcerer avatar Oct 19 '22 13:10 Dreamsorcerer

Hi @Dreamsorcerer ,

Thank you for providing the diff between the affected versions.

I'm reading through the code and trying to understand. But you will have more context and knowledge on how this library works. My guessings:

File web_protocol.py on line 505, will always pass the handler as None. Since it's set on line 472 to be the self._task_handler. Which on line 305 is set as None, and the change on the highlighted lines never cancels the task_handler.

Still, I lack complete context of the code, just writing what I see.

Then, in the web_app.py file, line 437 -> 454, protocol will be a RequestHandler, but is received as None, since as seen in this line: https://github.com/aio-libs/aiohttp/blob/8cf01adc8c8dbf706e4cd33bf89fd5195f638715/aiohttp/web_protocol.py#L51 it can be "asyncio.Task[None]".

I guess, that is setting it to None, somehow?

Again, just tried to read through the code, but have 0 context on how the library works.

My guesses might be completely wrong, apologies for that.

w0rmr1d3r avatar Nov 16 '22 21:11 w0rmr1d3r

File web_protocol.py on line 505, will always pass the handler as None. Since it's set on line 472 to be the self._task_handler.

Line 473 asserts it to be not None, so that's not possible.

My suggestion was to try adding back in the lines 302-303 that were deleted and see if the problem is still reproducible. If it is, then maybe checkout the code at different commits until you find the commit that introduces the problem (you could use git bisect for this). Also remember to do this on the 3.9 branch, not master.

Dreamsorcerer avatar Nov 17 '22 11:11 Dreamsorcerer

Hello @Dreamsorcerer , Sorry for my late response.

So, what you mean is, adding back those lines in the next release and see if the issue is fixed? I understand that to be version 3.9.0? And if those lines make the issue go away, re-review the commits on branch 3.9?

Just wanted to clarify and get right the steps on this.

Thank you for providing an answer and maintain this project 🚀

w0rmr1d3r avatar Dec 05 '22 11:12 w0rmr1d3r

Yes, just clone the 3.9 branch, add those lines in, then test if the issue is fixed (also check the issue is reproducible before adding the line in). If so, make a PR and we'll look at releasing it in 3.9.

Dreamsorcerer avatar Dec 05 '22 17:12 Dreamsorcerer

Hello @Dreamsorcerer , I haven't had time to put effort on this. We received the dependabot bump PR for 3.8.4 and saw the same error.

Hopefully in the future I might be able to put some effort on this to debug it or something.

w0rmr1d3r avatar Feb 14 '23 13:02 w0rmr1d3r

3.8.4 only added a moderate severity fix. If you can figure out this issue, we'll put it into the 3.9 release.

Dreamsorcerer avatar Feb 14 '23 16:02 Dreamsorcerer

Just here to confirm that we can see this issue happening in python 3.9, 3.10 and 3.11 with aiohttp 3.8.4

jsoucheiron avatar Apr 17 '23 15:04 jsoucheiron

I can also confirm that this issue occurs in our project with aiohttp 3.8.3 and Python 3.11.

odesenfans avatar Apr 21 '23 14:04 odesenfans

There aren't too many changes (if you ignore formatting/typing changes), so if anyone can take a look and figure out which change is the issue, that'd be great. These 2 removed lines are the only ones that stand out to me, so maybe try adding them back in first: https://github.com/aio-libs/aiohttp/compare/v3.8.1...v3.8.3#diff-2126b277e07e3fdd05e7a81da456accf24e5515a46c78c48a79db4eb166043e4L302-L303

@Dreamsorcerer I tried reverting these lines on top of v3.8.3 and this does not fix the issue. They were also the only lines standing out to me in the git history, so I'll need to take a deeper look.

Looking at the logic, it seems like this originates from the connection_lost path: the transport is set to None early in BaseProtocol.connection_lost(). I could not reproduce this issue locally though so no way to tell for sure. I didn't see anything out of the ordinary in the code, so maybe the issue is somewhere else, ex: some level of concurrency happening between connection_lost() and the request handler?

My setup:

  • Python 3.11
  • aiohttp 3.8.3
  • gunicorn + uvloop workers.

odesenfans avatar Apr 23 '23 17:04 odesenfans

@Dreamsorcerer I tried reverting these lines on top of v3.8.3 and this does not fix the issue. They were also the only lines standing out to me in the git history, so I'll need to take a deeper look.

If you can run it from the git repo, then a git bisect or something could help find the commit which introduced the issue.

Dreamsorcerer avatar Apr 23 '23 17:04 Dreamsorcerer

@Dreamsorcerer I tried reverting these lines on top of v3.8.3 and this does not fix the issue. They were also the only lines standing out to me in the git history, so I'll need to take a deeper look.

If you can run it from the git repo, then a git bisect or something could help find the commit which introduced the issue.

I'd love to, but I have no idea how to build the package properly to test inside a Docker image (only solution I have to reproduce the issue at the moment). I've looked at the docs but I couldn't find a tip, I always end up with "aiohttp/_websocket.c: File not found". Any idea?

odesenfans avatar Apr 24 '23 10:04 odesenfans

In the meantime, I realized I hadn't installed my patched version correctly. It appears that applying the patch below as suggested fixes the issue.

diff --git a/aiohttp/web_protocol.py b/aiohttp/web_protocol.py
index 10a96080..f103b167 100644
--- a/aiohttp/web_protocol.py
+++ b/aiohttp/web_protocol.py
@@ -299,6 +299,8 @@ class RequestHandler(BaseProtocol):
                 exc = ConnectionResetError("Connection lost")
             self._current_request._cancel(exc)
 
+        if self._task_handler is not None:
+            self._task_handler.cancel()
         if self._waiter is not None:
             self._waiter.cancel()

Now, I saw in the commit history that this patch has already been applied, removed and reapplied (cf. the discussion here: https://github.com/aio-libs/aiohttp/issues/6719). What's the best course of action? I don't see cancellation exceptions in my setup, maybe the whole issue discussed in #6719 has been fixed in the meantime.

odesenfans avatar Apr 24 '23 10:04 odesenfans

Now, I saw in the commit history that this patch has already been applied, removed and reapplied (cf. the discussion here: #6719). What's the best course of action?

OK, I've got a vague idea on what the cause may be now. Can you try removing the and not self._task_handler.done() (not on a production server though, I expect it to break): https://github.com/aio-libs/aiohttp/blob/master/aiohttp/web_protocol.py#L287

My theory is that on connection_lost on older releases (and optionally in 3.9) we cancel the handler. Now we are not cancelling, I suspect there may be an exception occurring on self.transport.close() or something similar, because the connection was already closed: https://github.com/aio-libs/aiohttp/blob/master/aiohttp/web_protocol.py#L610

So, I'm hoping that by making that change, instead of an exception was not retrieved warning, you'll get the actual exception and then we can catch that error in future.

Dreamsorcerer avatar Apr 24 '23 18:04 Dreamsorcerer

This also suggests that we may be able to build an actual test, if anyone wants to give that a go? Hopefully, it just means we need an endpoint that takes a couple of seconds and then have a client disconnect prematurely to trigger the warning.

Dreamsorcerer avatar Apr 24 '23 18:04 Dreamsorcerer

I've tried replicating this issue in a test, but it passes just fine on the 3.8 branch. Am I approaching this in a wrong way?

async def test_client_connection_lost(aiohttp_client: Any) -> None:
    async def handler(_):
        await asyncio.sleep(2)
        return web.Response()

    app = web.Application()
    app.router.add_route("GET", "/", handler)
    client = await aiohttp_client(app)

    with pytest.raises(TimeoutError):
        await asyncio.wait_for(
            client.get("/"),
            timeout=1,
        )
    await client.close()

I've set a breakpoint in BaseProtocol.connection_lost and saw it being called twice, once for the client, then another time for the server.

kjagiello avatar May 12 '23 15:05 kjagiello

I've never managed to reproduce it, which is why I keep asking others to test/debug.

Dreamsorcerer avatar May 12 '23 21:05 Dreamsorcerer

I am currently facing this issue, I do not think it's easily reproduced in a test as you need to send some sort of client/server disconnection into the request/response handler.

Task exception was never retrieved
future: <Task finished name='Task-70887' coro=<RequestHandler.start() done, defined at /opt/venv/lib/python3.10/site-packages/aiohttp/web_protocol.py:462> exception=AssertionError()>

I will go down back to 3.8.1.

nosahama avatar Jun 19 '23 09:06 nosahama

If my theory is right, then we should be able to write a test. But, I'm still waiting on someone to confirm the cause by hopefully getting the traceback as described in https://github.com/aio-libs/aiohttp/issues/6978#issuecomment-1520627194

Dreamsorcerer avatar Jun 19 '23 15:06 Dreamsorcerer

I have been pretty busy since April but I'll try to reproduce the issue in the weeks to come.

odesenfans avatar Jun 19 '23 16:06 odesenfans

OK, I played around with the code, manually adding in exceptions etc. and I couldn't figure out any way to reproduce this. Hopefully, #7333 will help, but I don't think there's anything else I can do unless someone figures something else out.

Dreamsorcerer avatar Jul 09 '23 20:07 Dreamsorcerer

Once #7333 is released I can definitely give it a try.

jsoucheiron avatar Jul 10 '23 08:07 jsoucheiron

Still happening for v3.8.5 which includes #7333

ignaciobolonio avatar Jul 27 '23 09:07 ignaciobolonio

Continuing the conversation that started here.

I think this ticket is the correct one for my case given that the traceback is the same here.

  1. The issue reproduces relatively rare (up to 10 times per day) in the prod env, it could happen with any end point randomly.
  2. To test @Dreamsorcerer your suggestions, I need to deploy a custom version of aiohttp to the cloud (GAE), but whatever I tried I couldn't make GAE to use my modified version of the library and it either continued to use the version from PyPI or fail to run.
  3. So instead I ran a local env with the modified version of the library. Then wrote a browser script that from 4 separate tabs triggered 4 different end points every 10 ms. Unfortunately even after about 40K requests, not a single time I got the exception and accordingly self._force_close, self._close were always False and nothing broke from doing this comment neither.

Is there any way that I can add additional logs to test this in the current version of the library that I use in prod (3.8.5)? Maybe call some aiohttp method and print something upon any end point being hit?

Appreciate your help with trying to figure this out.

sedrakk avatar Jul 31 '23 21:07 sedrakk

2. To test @Dreamsorcerer your suggestions, I need to deploy a custom version of aiohttp to the cloud (GAE), but whatever I tried I couldn't make GAE to use my modified version of the library and it either continued to use the version from PyPI or fail to run.

I'm not failiar with GAE, but can you just drop aiohttp into the project directory, where it loads your app from?

Is there any way that I can add additional logs to test this in the current version of the library that I use in prod (3.8.5)? Maybe call some aiohttp method and print something upon any end point being hit?

I can't think of an easy way. You could try and monkey patch the library, but you'd probably have to copy the entire function (e.g. .start() method), add the print in, and then assign it to the aiohttp class.

Dreamsorcerer avatar Jul 31 '23 21:07 Dreamsorcerer

I'm not failiar with GAE, but can you just drop aiohttp into the project directory, where it loads your app from?

Or failing that, copy into the app's directory as a module, and then change the imports to from myapp import aiohttp, if necessary (assuming that doesn't break aiohttp's imports).

Dreamsorcerer avatar Jul 31 '23 21:07 Dreamsorcerer