tornado Uncaught exception - After upgrading python version

Hi,

I'm using Tornado 5.1.1 Recently I upgraded my python from 2.7.15 to 3.6.8.

Yesterday I got a strange exceptions for a few seconds (tons of that):

2020-04-04 21:04:39.153 ERROR [4592 140100505929536 tornado.general] Uncaught exception
Traceback (most recent call last):
  File "/srv/dy/rcomserver/venv/lib64/python3.6/site-packages/tornado/http1connection.py", line 735, in _server_request_loop
    ret = yield conn.read_response(request_delegate)
  File "/srv/dy/rcomserver/venv/lib64/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/srv/dy/rcomserver/venv/lib64/python3.6/site-packages/tornado/gen.py", line 1147, in run
    yielded = self.gen.send(value)
  File "/srv/dy/rcomserver/venv/lib64/python3.6/site-packages/tornado/http1connection.py", line 245, in _read_message
    self.stream.set_close_callback(self._on_connection_close)
  File "/srv/dy/rcomserver/venv/lib64/python3.6/site-packages/tornado/iostream.py", line 606, in set_close_callback
    self._maybe_add_error_listener()
  File "/srv/dy/rcomserver/venv/lib64/python3.6/site-packages/tornado/iostream.py", line 1128, in _maybe_add_error_listener
    self._add_io_state(ioloop.IOLoop.READ)
  File "/srv/dy/rcomserver/venv/lib64/python3.6/site-packages/tornado/iostream.py", line 1158, in _add_io_state
    self.fileno(), self._handle_events, self._state)
  File "/srv/dy/rcomserver/venv/lib64/python3.6/site-packages/tornado/platform/asyncio.py", line 76, in add_handler
    raise ValueError("fd %s added twice" % fd)
ValueError: fd 65 added twice
2020-04-04 21:04:39.417 ERROR [4118 140423800596288 tornado.general] Uncaught exception
Traceback (most recent call last):
  File "/srv/dy/rcomserver/venv/lib64/python3.6/site-packages/tornado/http1connection.py", line 735, in _server_request_loop
    ret = yield conn.read_response(request_delegate)
  File "/srv/dy/rcomserver/venv/lib64/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/srv/dy/rcomserver/venv/lib64/python3.6/site-packages/tornado/gen.py", line 1147, in run
    yielded = self.gen.send(value)
  File "/srv/dy/rcomserver/venv/lib64/python3.6/site-packages/tornado/http1connection.py", line 245, in _read_message
    self.stream.set_close_callback(self._on_connection_close)
  File "/srv/dy/rcomserver/venv/lib64/python3.6/site-packages/tornado/iostream.py", line 606, in set_close_callback
    self._maybe_add_error_listener()
  File "/srv/dy/rcomserver/venv/lib64/python3.6/site-packages/tornado/iostream.py", line 1128, in _maybe_add_error_listener
    self._add_io_state(ioloop.IOLoop.READ)
  File "/srv/dy/rcomserver/venv/lib64/python3.6/site-packages/tornado/iostream.py", line 1158, in _add_io_state
    self.fileno(), self._handle_events, self._state)
  File "/srv/dy/rcomserver/venv/lib64/python3.6/site-packages/tornado/platform/asyncio.py", line 76, in add_handler
    raise ValueError("fd %s added twice" % fd)
ValueError: fd 63 added twice

Any idea why it happened? or how to debug/reproduce? not sure if related to my upgrade or not but I never saw that before..

Apr 06 '20 12:04 itayB

Adding more info: I think that the issue is not the python upgrade, but a more interesting scenario: We start using the with_timeout (2 seconds timeout) and in the future coroutine argument, we are doing some another async_http_client.fetc calls and adding some callbacks (calling to grpc services). I suspect that it all started then. Are we doing something wrong?

Apr 09 '20 10:04 itayB

I'm not aware of anything that would cause this error. gen.with_timeout should be fine (if you were using the asyncio equivalent then it's possible there are cancellation-related bugs). Nothing you've said so far sounds like a problem.

Apr 26 '20 13:04 bdarnell

Found this ticket while trying to debug an issue showing up in our testing environment. Would appreciate any suggestions on how I might further debug the problem. I'm not sure exactly what is the cause of this issue - possibly network related - but it started randomly yesterday on one server (works on multiple other servers).

Environment: Solaris, Python 3.5.7 Tornado: 6.0.3

xxxx-v.v.v[7637]:  ERROR Uncaught exception
Traceback (most recent call last):
  File "/opt/xxxx/2020-07-21.0806/lib/python3.5/site-packages/tornado/http1connection.py", line 817, in _server_request_loop
    ret = await conn.read_response(request_delegate)
  File "/opt/xxxx/2020-07-21.0806/lib/python3.5/site-packages/tornado/http1connection.py", line 184, in _read_message
    b"\r?\n\r?\n", max_bytes=self.params.max_header_size
  File "/opt/xxxx/2020-07-21.0806/lib/python3.5/site-packages/tornado/iostream.py", line 373, in read_until_regex
    self._try_inline_read()
  File "/opt/xxxx/2020-07-21.0806/lib/python3.5/site-packages/tornado/iostream.py", line 838, in _try_inline_read
    self._add_io_state(ioloop.IOLoop.READ)
  File "/opt/gt/2020-07-21.0806/lib/python3.5/site-packages/tornado/iostream.py", line 1050, in _add_io_state
    self.io_loop.add_handler(self.fileno(), self._handle_events, self._state)
  File "/opt/xxxx/2020-07-21.0806/lib/python3.5/site-packages/tornado/platform/asyncio.py", line 96, in add_handler
    raise ValueError("fd %s added twice" % fd)
ValueError: fd 9 added twice

I'll note that I can fix with a stupid hack like calling update_handler instead of raise ValueError... ie

    def add_handler(
        self, fd: Union[int, _Selectable], handler: Callable[..., None], events: int
    ) -> None:
        fd, fileobj = self.split_fd(fd)
        if fd in self.handlers:
            #raise ValueError("fd %s added twice" % fd)
            self.update_handler(fd, events)
        self.handlers[fd] = (fileobj, handler)
        if events & IOLoop.READ:
            self.asyncio_loop.add_reader(fd, self._handle_events, fd, IOLoop.READ)
            self.readers.add(fd)
        if events & IOLoop.WRITE:
            self.asyncio_loop.add_writer(fd, self._handle_events, fd, IOLoop.WRITE)
            self.writers.add(fd)

Edit: Note at this time I believe the issue is actually somehow a CPU bound related problem and not a network related issue. Still investigating.

Sep 23 '20 19:09 petriborg

Hmm. I'm not aware of many people running Tornado on Solaris, and I've never tested there. First thing I'd do is run python3 -m tornado.test to see if the test suite is reporting any errors.

If the test suite doesn't turn up anything, the next thing that comes to mind is that maybe removing the fd is failing somewhere but the error is getting swallowed (I know we've got some broad except statements for removing file descriptors in the curl_httpclient code and we used to have them in the old IOLoop implementation, but I can't remember any particular trouble spots that still exist and are relevant to this stack trace. I'd look at any remove_handler calls and see if any errors could have gone missing.

Note that your stack trace and the first one in this thread aren't quite the same. Both start with conn.read_response but yours is in _try_inline_read and the first one is in set_close_callback. I'd still assume they have the same root cause, but it's possible there are two distinct issues here.

Sep 24 '20 00:09 bdarnell

I'm seeing similar, only very occasionally in a high-load environment, running under ubuntu-20.04 / linux-5.4 / python-3.7 / tornado-6.0.4:

ts=2020-09-23T19:48:17.176 level=ERROR loc=(tornado/http1connection.py:829) Uncaught exception
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/tornado/http1connection.py", line 817, in _server_request_loop
    ret = await conn.read_response(request_delegate)
  File "/usr/local/lib/python3.7/dist-packages/tornado/http1connection.py", line 282, in _read_message
    self.stream.set_close_callback(self._on_connection_close)
  File "/usr/local/lib/python3.7/dist-packages/tornado/iostream.py", line 578, in set_close_callback
    self._maybe_add_error_listener()
  File "/usr/local/lib/python3.7/dist-packages/tornado/iostream.py", line 1050, in _maybe_add_error_listener
    self._add_io_state(ioloop.IOLoop.READ)
  File "/usr/local/lib/python3.7/dist-packages/tornado/iostream.py", line 1076, in _add_io_state
    self.io_loop.add_handler(self.fileno(), self._handle_events, self._state)
  File "/usr/local/lib/python3.7/dist-packages/tornado/platform/asyncio.py", line 97, in add_handler
    raise ValueError("fd %s added twice" % fd)
ValueError: fd 32 added twice

(everything seems fine though, the service continues to successfully service requests, as far as I can tell ...)

Sep 24 '20 01:09 ploxiln

Can confirm that the tornado tests all pass on Solaris.

I agree the traces aren't entirely similar, only that the failure point is the same. I have a pcap of the issue and here is what I'm seeing from a network perspective on the failed transaction:

apache -> tornado [syn]
apache <- tornado [syn, ack]
apache -> tornado [ack]
apache -> tornado [psh, ack]
apache <- tornado [ack]
apache -> tornado [psh, ack] POST completed (151 byte body)
apache <- tornado [ack]
apache <- tornado [fin, ack]
apache -> tornado [fin, ack]
apache <- tornado [ack]

From a code perspective, it appears that the iostream self._state is None when executing _try_inline_read, and so it calls the add_handler. Getting a debug trace of the issue is tricky - I'll see what I can do. Might be a few days.

EDIT - Unfortunately I had to switch away from this topic to work on a near deadline. Hopefully I'll be able to look into it sometime later.

Sep 24 '20 13:09 petriborg

If the test suite doesn't turn up anything, the next thing that comes to mind is that maybe removing the fd is failing somewhere but the error is getting swallowed (I know we've got some broad except statements for removing file descriptors in the curl_httpclient code and we used to have them in the old IOLoop implementation, but I can't remember any particular trouble spots that still exist and are relevant to this stack trace. I'd look at any remove_handler calls and see if any errors could have gone missing.

@bdarnell I'm also seeing curl_httpclient in the traceback if you take a look at the linked issue above. We were able to reproduce this issue by invoking the same function asynchronously at the same time. It might be related to thread safety.

Since https://github.com/tornadoweb/tornado/blob/master/tornado/curl_httpclient.py#L129-L131 needs to be run atomically, do you think adding a lock before these lines be a potential fix (for curl_httpclient)?

I'm thinking some step to reproduce the error could look like:

# Oversimplifed Pseudo-code

global_client = CurlAsyncHTTPClient()

t1 = Thread(lambda: global_client.do_something())
t2 = Thread(lambda: global_client.do_something())
t1.start()
t2.start()

# Hopefully some errors occur here

t1.join()
t2.join()

Mar 01 '21 17:03 tonyx93

do you think adding a lock before these lines be a potential fix (for curl_httpclient)?

It is a general requirement that all calling of tornado object methods must be done from the event loop thread. Tornado is generally not thread safe.

References: https://www.tornadoweb.org/en/stable/web.html#thread-safety-notes https://www.tornadoweb.org/en/stable/ioloop.html#tornado.ioloop.IOLoop.add_callback

Mar 01 '21 17:03 ploxiln

tornado tornado copied to clipboard

Uncaught exception - After upgrading python version

tornado
tornado copied to clipboard