Connection issue with Temporal 1.9.x
I get a connection issue when running my worker against the latest version of Temporal:
2021-05-10 08:37:04,083 | DEBUG | protocol.py:data_received:714 | Protocol error
Traceback (most recent call last):
File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/h2/connection.py", line 224, in process_input
func, target_state = self._transitions[(self.state, input_)]
KeyError: (<ConnectionState.CLOSED: 3>, <ConnectionInputs.RECV_PING: 14>)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/grpclib/protocol.py", line 712, in data_received
events = self.connection.feed(data)
File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/grpclib/protocol.py", line 189, in feed
return self._connection.receive_data(data) # type: ignore
File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/h2/connection.py", line 1463, in receive_data
events.extend(self._receive_frame(frame))
File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/h2/connection.py", line 1486, in _receive_frame
frames, events = self._frame_dispatch_table[frame.__class__](frame)
File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/h2/connection.py", line 1759, in _receive_ping_frame
events = self.state_machine.process_input(
File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/h2/connection.py", line 228, in process_input
raise ProtocolError(
h2.exceptions.ProtocolError: Invalid input ConnectionInputs.RECV_PING in state ConnectionState.CLOSED
2021-05-10 08:37:04,103 | ERROR | retry.py:retry_loop:29 | run failed: Connection lost, retrying in 6 seconds
Traceback (most recent call last):
File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/grpclib/client.py", line 360, in recv_initial_metadata
headers = await self._stream.recv_headers()
File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/grpclib/protocol.py", line 342, in recv_headers
await self.headers_received.wait()
File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/asyncio/locks.py", line 309, in wait
await fut
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/temporal/retry.py", line 17, in retry_loop
await fp(*args, **kwargs)
File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/temporal/decision_loop.py", line 934, in run
decision_task: PollWorkflowTaskQueueResponse = await self.poll()
File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/temporal/decision_loop.py", line 983, in poll
task = await self.service.poll_workflow_task_queue(request=poll_decision_request)
File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/temporal/api/workflowservice/v1.py", line 844, in poll_workflow_task_queue
return await self._unary_unary(
File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/betterproto/__init__.py", line 1133, in _unary_unary
response = await stream.recv_message()
File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/grpclib/client.py", line 408, in recv_message
await self.recv_initial_metadata()
File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/grpclib/client.py", line 380, in recv_initial_metadata
self.initial_metadata = im
File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/grpclib/utils.py", line 70, in __exit__
raise self._error
grpclib.exceptions.StreamTerminatedError: Connection lost
Here's what I've tracked down:
- Temporal added keep alive options in 1.9.0.
- The new config option
frontend.keepAliveMaxConnectionAgedefaults to 5 minutes (it wasn't set prior to it being added, so it defaulted to infinity in the GRPC Go SDK).-
MaxConnectionAge is a duration for the maximum amount of time a connection may exist before it will be closed by sending a GoAway.
-
- This Python SDK uses
grpclib, which usesh2(an HTTP/2 library). h2 doesn't support gracefully closing connections. I believe that explains this issue.
WORKAROUND
Set frontend.keepAliveMaxConnectionAge in the dynamic config to a large value (not sure how to set it to infinity like it defaulted to pre-1.9.0 release):
frontend.keepAliveMaxConnectionAge:
- value: 87600h # 10 years
Chad, you are correct with your investigation, however the fact that you are seeing errors on the client side is unexpected. We've added this configuration to initiate healthy rotation of connections and to better re-balance them over time.
According to grpc documetation for keep-alive, when MaxConnectionAge is reached, GoAway signal is sent to the client and if client doesn't close the connection within MaxConnectionAgeGrace period only then connection will be forcibly closed. The idea behind it is that grace period should be longer than the long poll request time, which should result in clean connection closure.
Is it possible that Python implementation of grpc that you are using is not handling GoAway signal properly? (this issue?) My expectation is that client should finish ongoing requests using an old connection, and route new requests into sub-channels that are using new connection.
@vitarb this project uses grpclib, which uses h2. You're spot on that it isn't handling the GoAway signal properly (issue)
What do you think is the right thing to do here? My opinion is that we should aim for an actual fix in python, because server defaults seem to be reasonable according to our usage profile and gRPC spec and I don't think we should be changing them. Meanwhile overriding MaxConnectionAge on the server to a large or unlimited value sounds like a good mitigation strategy.
I agree with you. A couple of options are:
- Implement GoAway handling in the h2 library
- Update to use grpcio channel over grpclib channel - maybe grpcio's channel handles GoAway signals?
- This one might require a huge refactor because the protobuffers are compiled using python-betterproto and it looks like it uses grpclib underneath.
As we've continued migrating existing applications to temporal using the python-sdk, it looks like we also need to override the config option frontend.keepAliveTime. Temporal server sets it to 1 minute. Some of our activities take longer than 60 seconds and we get the same Connection lost error. Setting this to a longer time fixes the issue.