temporal-python-sdk Connection issue with Temporal 1.9.x

I get a connection issue when running my worker against the latest version of Temporal:

2021-05-10 08:37:04,083 | DEBUG | protocol.py:data_received:714 | Protocol error
Traceback (most recent call last):
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/h2/connection.py", line 224, in process_input
    func, target_state = self._transitions[(self.state, input_)]
KeyError: (<ConnectionState.CLOSED: 3>, <ConnectionInputs.RECV_PING: 14>)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/grpclib/protocol.py", line 712, in data_received
    events = self.connection.feed(data)
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/grpclib/protocol.py", line 189, in feed
    return self._connection.receive_data(data)  # type: ignore
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/h2/connection.py", line 1463, in receive_data
    events.extend(self._receive_frame(frame))
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/h2/connection.py", line 1486, in _receive_frame
    frames, events = self._frame_dispatch_table[frame.__class__](frame)
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/h2/connection.py", line 1759, in _receive_ping_frame
    events = self.state_machine.process_input(
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/h2/connection.py", line 228, in process_input
    raise ProtocolError(
h2.exceptions.ProtocolError: Invalid input ConnectionInputs.RECV_PING in state ConnectionState.CLOSED
2021-05-10 08:37:04,103 | ERROR | retry.py:retry_loop:29 | run failed: Connection lost, retrying in 6 seconds
Traceback (most recent call last):
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/grpclib/client.py", line 360, in recv_initial_metadata
    headers = await self._stream.recv_headers()
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/grpclib/protocol.py", line 342, in recv_headers
    await self.headers_received.wait()
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/asyncio/locks.py", line 309, in wait
    await fut
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/temporal/retry.py", line 17, in retry_loop
    await fp(*args, **kwargs)
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/temporal/decision_loop.py", line 934, in run
    decision_task: PollWorkflowTaskQueueResponse = await self.poll()
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/temporal/decision_loop.py", line 983, in poll
    task = await self.service.poll_workflow_task_queue(request=poll_decision_request)
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/temporal/api/workflowservice/v1.py", line 844, in poll_workflow_task_queue
    return await self._unary_unary(
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/betterproto/__init__.py", line 1133, in _unary_unary
    response = await stream.recv_message()
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/grpclib/client.py", line 408, in recv_message
    await self.recv_initial_metadata()
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/grpclib/client.py", line 380, in recv_initial_metadata
    self.initial_metadata = im
  File "/Users/chad.greenburg/.pyenv/versions/3.8.5/lib/python3.8/site-packages/grpclib/utils.py", line 70, in __exit__
    raise self._error
grpclib.exceptions.StreamTerminatedError: Connection lost

May 10 '21 13:05 chad-codes

Here's what I've tracked down:

Temporal added keep alive options in 1.9.0.
The new config option frontend.keepAliveMaxConnectionAge defaults to 5 minutes (it wasn't set prior to it being added, so it defaulted to infinity in the GRPC Go SDK).
- MaxConnectionAge is a duration for the maximum amount of time a connection may exist before it will be closed by sending a GoAway.
This Python SDK uses grpclib, which uses h2 (an HTTP/2 library). h2 doesn't support gracefully closing connections. I believe that explains this issue.

WORKAROUND

Set frontend.keepAliveMaxConnectionAge in the dynamic config to a large value (not sure how to set it to infinity like it defaulted to pre-1.9.0 release):

frontend.keepAliveMaxConnectionAge:
  - value: 87600h # 10 years

May 10 '21 22:05 chad-codes

Chad, you are correct with your investigation, however the fact that you are seeing errors on the client side is unexpected. We've added this configuration to initiate healthy rotation of connections and to better re-balance them over time.

According to grpc documetation for keep-alive, when MaxConnectionAge is reached, GoAway signal is sent to the client and if client doesn't close the connection within MaxConnectionAgeGrace period only then connection will be forcibly closed. The idea behind it is that grace period should be longer than the long poll request time, which should result in clean connection closure.

Is it possible that Python implementation of grpc that you are using is not handling GoAway signal properly? (this issue?) My expectation is that client should finish ongoing requests using an old connection, and route new requests into sub-channels that are using new connection.

May 10 '21 23:05 vitarb

@vitarb this project uses grpclib, which uses h2. You're spot on that it isn't handling the GoAway signal properly (issue)

May 11 '21 15:05 chad-codes

What do you think is the right thing to do here? My opinion is that we should aim for an actual fix in python, because server defaults seem to be reasonable according to our usage profile and gRPC spec and I don't think we should be changing them. Meanwhile overriding MaxConnectionAge on the server to a large or unlimited value sounds like a good mitigation strategy.

May 11 '21 23:05 vitarb

I agree with you. A couple of options are:

Implement GoAway handling in the h2 library
Update to use grpcio channel over grpclib channel - maybe grpcio's channel handles GoAway signals?
- This one might require a huge refactor because the protobuffers are compiled using python-betterproto and it looks like it uses grpclib underneath.

May 15 '21 00:05 chad-codes

As we've continued migrating existing applications to temporal using the python-sdk, it looks like we also need to override the config option frontend.keepAliveTime. Temporal server sets it to 1 minute. Some of our activities take longer than 60 seconds and we get the same Connection lost error. Setting this to a longer time fixes the issue.

May 15 '21 00:05 chad-codes