tornado
tornado copied to clipboard
Uncaught exception - After upgrading python version
Hi,
I'm using Tornado 5.1.1 Recently I upgraded my python from 2.7.15 to 3.6.8.
Yesterday I got a strange exceptions for a few seconds (tons of that):
2020-04-04 21:04:39.153 ERROR [4592 140100505929536 tornado.general] Uncaught exception
Traceback (most recent call last):
File "/srv/dy/rcomserver/venv/lib64/python3.6/site-packages/tornado/http1connection.py", line 735, in _server_request_loop
ret = yield conn.read_response(request_delegate)
File "/srv/dy/rcomserver/venv/lib64/python3.6/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/srv/dy/rcomserver/venv/lib64/python3.6/site-packages/tornado/gen.py", line 1147, in run
yielded = self.gen.send(value)
File "/srv/dy/rcomserver/venv/lib64/python3.6/site-packages/tornado/http1connection.py", line 245, in _read_message
self.stream.set_close_callback(self._on_connection_close)
File "/srv/dy/rcomserver/venv/lib64/python3.6/site-packages/tornado/iostream.py", line 606, in set_close_callback
self._maybe_add_error_listener()
File "/srv/dy/rcomserver/venv/lib64/python3.6/site-packages/tornado/iostream.py", line 1128, in _maybe_add_error_listener
self._add_io_state(ioloop.IOLoop.READ)
File "/srv/dy/rcomserver/venv/lib64/python3.6/site-packages/tornado/iostream.py", line 1158, in _add_io_state
self.fileno(), self._handle_events, self._state)
File "/srv/dy/rcomserver/venv/lib64/python3.6/site-packages/tornado/platform/asyncio.py", line 76, in add_handler
raise ValueError("fd %s added twice" % fd)
ValueError: fd 65 added twice
2020-04-04 21:04:39.417 ERROR [4118 140423800596288 tornado.general] Uncaught exception
Traceback (most recent call last):
File "/srv/dy/rcomserver/venv/lib64/python3.6/site-packages/tornado/http1connection.py", line 735, in _server_request_loop
ret = yield conn.read_response(request_delegate)
File "/srv/dy/rcomserver/venv/lib64/python3.6/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/srv/dy/rcomserver/venv/lib64/python3.6/site-packages/tornado/gen.py", line 1147, in run
yielded = self.gen.send(value)
File "/srv/dy/rcomserver/venv/lib64/python3.6/site-packages/tornado/http1connection.py", line 245, in _read_message
self.stream.set_close_callback(self._on_connection_close)
File "/srv/dy/rcomserver/venv/lib64/python3.6/site-packages/tornado/iostream.py", line 606, in set_close_callback
self._maybe_add_error_listener()
File "/srv/dy/rcomserver/venv/lib64/python3.6/site-packages/tornado/iostream.py", line 1128, in _maybe_add_error_listener
self._add_io_state(ioloop.IOLoop.READ)
File "/srv/dy/rcomserver/venv/lib64/python3.6/site-packages/tornado/iostream.py", line 1158, in _add_io_state
self.fileno(), self._handle_events, self._state)
File "/srv/dy/rcomserver/venv/lib64/python3.6/site-packages/tornado/platform/asyncio.py", line 76, in add_handler
raise ValueError("fd %s added twice" % fd)
ValueError: fd 63 added twice
Any idea why it happened? or how to debug/reproduce? not sure if related to my upgrade or not but I never saw that before..
Adding more info:
I think that the issue is not the python upgrade, but a more interesting scenario:
We start using the with_timeout (2 seconds timeout) and in the future coroutine argument, we are doing some another async_http_client.fetc calls and adding some callbacks (calling to grpc services).
I suspect that it all started then. Are we doing something wrong?
I'm not aware of anything that would cause this error. gen.with_timeout should be fine (if you were using the asyncio equivalent then it's possible there are cancellation-related bugs). Nothing you've said so far sounds like a problem.
Found this ticket while trying to debug an issue showing up in our testing environment. Would appreciate any suggestions on how I might further debug the problem. I'm not sure exactly what is the cause of this issue - possibly network related - but it started randomly yesterday on one server (works on multiple other servers).
Environment: Solaris, Python 3.5.7 Tornado: 6.0.3
xxxx-v.v.v[7637]: ERROR Uncaught exception
Traceback (most recent call last):
File "/opt/xxxx/2020-07-21.0806/lib/python3.5/site-packages/tornado/http1connection.py", line 817, in _server_request_loop
ret = await conn.read_response(request_delegate)
File "/opt/xxxx/2020-07-21.0806/lib/python3.5/site-packages/tornado/http1connection.py", line 184, in _read_message
b"\r?\n\r?\n", max_bytes=self.params.max_header_size
File "/opt/xxxx/2020-07-21.0806/lib/python3.5/site-packages/tornado/iostream.py", line 373, in read_until_regex
self._try_inline_read()
File "/opt/xxxx/2020-07-21.0806/lib/python3.5/site-packages/tornado/iostream.py", line 838, in _try_inline_read
self._add_io_state(ioloop.IOLoop.READ)
File "/opt/gt/2020-07-21.0806/lib/python3.5/site-packages/tornado/iostream.py", line 1050, in _add_io_state
self.io_loop.add_handler(self.fileno(), self._handle_events, self._state)
File "/opt/xxxx/2020-07-21.0806/lib/python3.5/site-packages/tornado/platform/asyncio.py", line 96, in add_handler
raise ValueError("fd %s added twice" % fd)
ValueError: fd 9 added twice
I'll note that I can fix with a stupid hack like calling update_handler instead of raise ValueError... ie
def add_handler(
self, fd: Union[int, _Selectable], handler: Callable[..., None], events: int
) -> None:
fd, fileobj = self.split_fd(fd)
if fd in self.handlers:
#raise ValueError("fd %s added twice" % fd)
self.update_handler(fd, events)
self.handlers[fd] = (fileobj, handler)
if events & IOLoop.READ:
self.asyncio_loop.add_reader(fd, self._handle_events, fd, IOLoop.READ)
self.readers.add(fd)
if events & IOLoop.WRITE:
self.asyncio_loop.add_writer(fd, self._handle_events, fd, IOLoop.WRITE)
self.writers.add(fd)
Edit: Note at this time I believe the issue is actually somehow a CPU bound related problem and not a network related issue. Still investigating.
Hmm. I'm not aware of many people running Tornado on Solaris, and I've never tested there. First thing I'd do is run python3 -m tornado.test to see if the test suite is reporting any errors.
If the test suite doesn't turn up anything, the next thing that comes to mind is that maybe removing the fd is failing somewhere but the error is getting swallowed (I know we've got some broad except statements for removing file descriptors in the curl_httpclient code and we used to have them in the old IOLoop implementation, but I can't remember any particular trouble spots that still exist and are relevant to this stack trace. I'd look at any remove_handler calls and see if any errors could have gone missing.
Note that your stack trace and the first one in this thread aren't quite the same. Both start with conn.read_response but yours is in _try_inline_read and the first one is in set_close_callback. I'd still assume they have the same root cause, but it's possible there are two distinct issues here.
I'm seeing similar, only very occasionally in a high-load environment, running under ubuntu-20.04 / linux-5.4 / python-3.7 / tornado-6.0.4:
ts=2020-09-23T19:48:17.176 level=ERROR loc=(tornado/http1connection.py:829) Uncaught exception
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/tornado/http1connection.py", line 817, in _server_request_loop
ret = await conn.read_response(request_delegate)
File "/usr/local/lib/python3.7/dist-packages/tornado/http1connection.py", line 282, in _read_message
self.stream.set_close_callback(self._on_connection_close)
File "/usr/local/lib/python3.7/dist-packages/tornado/iostream.py", line 578, in set_close_callback
self._maybe_add_error_listener()
File "/usr/local/lib/python3.7/dist-packages/tornado/iostream.py", line 1050, in _maybe_add_error_listener
self._add_io_state(ioloop.IOLoop.READ)
File "/usr/local/lib/python3.7/dist-packages/tornado/iostream.py", line 1076, in _add_io_state
self.io_loop.add_handler(self.fileno(), self._handle_events, self._state)
File "/usr/local/lib/python3.7/dist-packages/tornado/platform/asyncio.py", line 97, in add_handler
raise ValueError("fd %s added twice" % fd)
ValueError: fd 32 added twice
(everything seems fine though, the service continues to successfully service requests, as far as I can tell ...)
Can confirm that the tornado tests all pass on Solaris.
I agree the traces aren't entirely similar, only that the failure point is the same. I have a pcap of the issue and here is what I'm seeing from a network perspective on the failed transaction:
apache -> tornado [syn]
apache <- tornado [syn, ack]
apache -> tornado [ack]
apache -> tornado [psh, ack]
apache <- tornado [ack]
apache -> tornado [psh, ack] POST completed (151 byte body)
apache <- tornado [ack]
apache <- tornado [fin, ack]
apache -> tornado [fin, ack]
apache <- tornado [ack]
From a code perspective, it appears that the iostream self._state is None when executing _try_inline_read, and so it calls the add_handler. Getting a debug trace of the issue is tricky - I'll see what I can do. Might be a few days.
EDIT - Unfortunately I had to switch away from this topic to work on a near deadline. Hopefully I'll be able to look into it sometime later.
If the test suite doesn't turn up anything, the next thing that comes to mind is that maybe removing the
fdis failing somewhere but the error is getting swallowed (I know we've got some broadexceptstatements for removing file descriptors in thecurl_httpclientcode and we used to have them in the old IOLoop implementation, but I can't remember any particular trouble spots that still exist and are relevant to this stack trace. I'd look at anyremove_handlercalls and see if any errors could have gone missing.
@bdarnell
I'm also seeing curl_httpclient in the traceback if you take a look at the linked issue above.
We were able to reproduce this issue by invoking the same function asynchronously at the same time.
It might be related to thread safety.
Since https://github.com/tornadoweb/tornado/blob/master/tornado/curl_httpclient.py#L129-L131 needs to be run atomically, do you think adding a lock before these lines be a potential fix (for curl_httpclient)?
I'm thinking some step to reproduce the error could look like:
# Oversimplifed Pseudo-code
global_client = CurlAsyncHTTPClient()
t1 = Thread(lambda: global_client.do_something())
t2 = Thread(lambda: global_client.do_something())
t1.start()
t2.start()
# Hopefully some errors occur here
t1.join()
t2.join()
do you think adding a lock before these lines be a potential fix (for curl_httpclient)?
It is a general requirement that all calling of tornado object methods must be done from the event loop thread. Tornado is generally not thread safe.
References: https://www.tornadoweb.org/en/stable/web.html#thread-safety-notes https://www.tornadoweb.org/en/stable/ioloop.html#tornado.ioloop.IOLoop.add_callback