linkerd2 Linkerd closing connections of long-running gRPC requests

What is the issue?

We're using Temporal.io with Linkerd. Temporal worker uses long-polling. It makes a poll call that blocks for 60 seconds. If no task is received during this time, the call returns an empty result, and a new call is created immediately.

It seems that Linkerd is closing the connection before Temporal does, and when Temporal tries to close the connection, it throws the exception mentioned above.

It seems most requests are capped at 50 seconds (see attached image below).

image (8)

How can it be reproduced?

This can probably be reproduced by sending a gRPC request, and having the server wait for a minute - then respond. If not - install Temporal and inject the proxies. You don't even have to implement any workloads with the SDK, Temporal's own workloads will get the error (Worker->Frontend).

Logs, error output, etc

Another user who had the same issue posted these logs (they run Java):

io.grpc.StatusRuntimeException: INTERNAL: unexpected error
	at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:271) ~[grpc-stub-1.54.1.jar!/:1.54.1]
	at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:252) ~[grpc-stub-1.54.1.jar!/:1.54.1]
	at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:165) ~[grpc-stub-1.54.1.jar!/:1.54.1]
	at io.temporal.api.workflowservice.v1.WorkflowServiceGrpc$WorkflowServiceBlockingStub.pollWorkflowTaskQueue(WorkflowServiceGrpc.java:3910) ~[temporal-serviceclient-1.21.1.jar!/:na]
	at io.temporal.internal.worker.WorkflowPollTask.doPoll(WorkflowPollTask.java:155) ~[temporal-sdk-1.21.1.jar!/:na]
	at io.temporal.internal.worker.WorkflowPollTask.poll(WorkflowPollTask.java:137) ~[temporal-sdk-1.21.1.jar!/:na]
	at io.temporal.internal.worker.WorkflowPollTask.poll(WorkflowPollTask.java:46) ~[temporal-sdk-1.21.1.jar!/:na]
	at io.temporal.internal.worker.Poller$PollExecutionTask.run(Poller.java:298) ~[temporal-sdk-1.21.1.jar!/:na]
	at io.temporal.internal.worker.Poller$PollLoopTask.run(Poller.java:258) ~[temporal-sdk-1.21.1.jar!/:na]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[na:na]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[na:na]
	at java.base/java.lang.Thread.run(Thread.java:840) ~[na:na]
[11:29](https://linkerd.slack.com/archives/C89RTCWJF/p1723451398491159?thread_ts=1723451335.174499&cid=C89RTCWJF)
We are on version 2.14.7, but this issue persists on the latest edge versions as well.
[11:34](https://linkerd.slack.com/archives/C89RTCWJF/p1723451671778799?thread_ts=1723451335.174499&cid=C89RTCWJF)
[  1081.455684s] TRACE ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}: linkerd_stack_tracing: service request=Request { method: POST, uri: http://temporal-frontend.temporal.svc.cluster.local:7233/temporal.api.workflowservice.v1.WorkflowService/PollWorkflowTaskQueue, version: HTTP/2.0, headers: {"content-type": "application/grpc", "te": "trailers", "user-agent": "grpc-java-netty/1.54.1", "client-version": "1.21.1", "supported-server-versions": ">=0.31.0 <2.0.0", "client-name": "temporal-java", "grpc-accept-encoding": "gzip", "grpc-timeout": "69997880u"}, body: BoxBody }
[  1081.455695s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}: linkerd_proxy_http::classify::channel: state=Some(State { classify: Grpc(Codes({2, 4, 7, 13, 14, 15})), tx: Sender { chan: Tx { inner: Chan { tx: Tx { block_tail: 0x7f71b37da900, tail_position: 0 }, semaphore: Semaphore { semaphore: Semaphore { permits: 10000 }, bound: 10000 }, rx_waker: AtomicWaker, tx_count: 4, rx_fields: "..." } } } })
[  1081.455709s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}:h2: linkerd_proxy_http::client: method=POST uri=http://temporal-frontend.temporal.svc.cluster.local:7233/temporal.api.workflowservice.v1.WorkflowService/PollWorkflowTaskQueue version=HTTP/2.0
[  1081.455714s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}:h2: linkerd_proxy_http::client: headers={"content-type": "application/grpc", "te": "trailers", "user-agent": "grpc-java-netty/1.54.1", "client-version": "1.21.1", "supported-server-versions": ">=0.31.0 <2.0.0", "client-name": "temporal-java", "grpc-accept-encoding": "gzip", "grpc-timeout": "69997880u"}
[  1085.005443s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http: linkerd_proxy_http::server: The client is shutting down the connection res=Ok(())
[  1085.005508s] TRACE ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}: linkerd_detect: Connection completed
[  1085.005519s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}: linkerd_app_core::serve: Connection closed
[  1085.005530s] TRACE ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}: linkerd_stack_tracing: drop
[  1121.833118s] TRACE ThreadId(01) policy:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}: linkerd_proxy_balance::discover::from_resolve: poll
[  1141.459115s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}: linkerd_app_outbound::http::handle_proxy_error_headers: Remote proxy error

Proxy logs:

[  1081.455684s] TRACE ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}: linkerd_stack_tracing: service request=Request { method: POST, uri: http://temporal-frontend.temporal.svc.cluster.local:7233/temporal.api.workflowservice.v1.WorkflowService/PollWorkflowTaskQueue, version: HTTP/2.0, headers: {"content-type": "application/grpc", "te": "trailers", "user-agent": "grpc-java-netty/1.54.1", "client-version": "1.21.1", "supported-server-versions": ">=0.31.0 <2.0.0", "client-name": "temporal-java", "grpc-accept-encoding": "gzip", "grpc-timeout": "69997880u"}, body: BoxBody }
[  1081.455695s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}: linkerd_proxy_http::classify::channel: state=Some(State { classify: Grpc(Codes({2, 4, 7, 13, 14, 15})), tx: Sender { chan: Tx { inner: Chan { tx: Tx { block_tail: 0x7f71b37da900, tail_position: 0 }, semaphore: Semaphore { semaphore: Semaphore { permits: 10000 }, bound: 10000 }, rx_waker: AtomicWaker, tx_count: 4, rx_fields: "..." } } } })
[  1081.455709s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}:h2: linkerd_proxy_http::client: method=POST uri=http://temporal-frontend.temporal.svc.cluster.local:7233/temporal.api.workflowservice.v1.WorkflowService/PollWorkflowTaskQueue version=HTTP/2.0
[  1081.455714s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}:h2: linkerd_proxy_http::client: headers={"content-type": "application/grpc", "te": "trailers", "user-agent": "grpc-java-netty/1.54.1", "client-version": "1.21.1", "supported-server-versions": ">=0.31.0 <2.0.0", "client-name": "temporal-java", "grpc-accept-encoding": "gzip", "grpc-timeout": "69997880u"}
[  1085.005443s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http: linkerd_proxy_http::server: The client is shutting down the connection res=Ok(())
[  1085.005508s] TRACE ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}: linkerd_detect: Connection completed
[  1085.005519s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}: linkerd_app_core::serve: Connection closed
[  1085.005530s] TRACE ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}: linkerd_stack_tracing: drop
[  1121.833118s] TRACE ThreadId(01) policy:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}: linkerd_proxy_balance::discover::from_resolve: poll
[  1141.459115s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}: linkerd_app_outbound::http::handle_proxy_error_headers: Remote proxy error

output of `linkerd check -o short`

Status check results are √

Environment

Kubernetes Version: Server Version: v1.28.11-eks-db838b0
Cluster Environment: EKS
Host OS: Mac M2
Linkerd version: Client version: edge-24.8.2 Server version: edge-24.8.2

Possible solution

It's possible we have a timeout or some setting that causes Linkerd to disrupt long-running gRPC connections. We tried increasing the overall timeouts (on the proxy and also specifically with GRPCRoutes) - with no success.

Additional context

Relevant Slack Conversations: https://linkerd.slack.com/archives/C89RTCWJF/p1723451335174499 https://linkerd.slack.com/archives/C89RTCWJF/p1723658831519119 https://linkerd.slack.com/archives/C89RTCWJF/p1723659323180089 https://linkerd.slack.com/archives/C89RTCWJF/p1724061511236179

My email / Slack user: [email protected] Another user experiencing this (Slack): @Prajith

Would you like to work on fixing this bug?

None

Aug 19 '24 12:08 eli-nomasec

Linkerd doesn't have any default request timeouts. The log messages you include do not indicate any sort of timeout behavior.

[  1081.455684s] TRACE ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}: linkerd_stack_tracing: service request=Request { method: POST, uri: http://temporal-frontend.temporal.svc.cluster.local:7233/temporal.api.workflowservice.v1.WorkflowService/PollWorkflowTaskQueue, version: HTTP/2.0, headers: {"content-type": "application/grpc", "te": "trailers", "user-agent": "grpc-java-netty/1.54.1", "client-version": "1.21.1", "supported-server-versions": ">=0.31.0 <2.0.0", "client-name": "temporal-java", "grpc-accept-encoding": "gzip", "grpc-timeout": "69997880u"}, body: BoxBody }
[  1081.455695s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}: linkerd_proxy_http::classify::channel: state=Some(State { classify: Grpc(Codes({2, 4, 7, 13, 14, 15})), tx: Sender { chan: Tx { inner: Chan { tx: Tx { block_tail: 0x7f71b37da900, tail_position: 0 }, semaphore: Semaphore { semaphore: Semaphore { permits: 10000 }, bound: 10000 }, rx_waker: AtomicWaker, tx_count: 4, rx_fields: "..." } } } })
[  1081.455709s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}:h2: linkerd_proxy_http::client: method=POST uri=http://temporal-frontend.temporal.svc.cluster.local:7233/temporal.api.workflowservice.v1.WorkflowService/PollWorkflowTaskQueue version=HTTP/2.0
[  1081.455714s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}:h2: linkerd_proxy_http::client: headers={"content-type": "application/grpc", "te": "trailers", "user-agent": "grpc-java-netty/1.54.1", "client-version": "1.21.1", "supported-server-versions": ">=0.31.0 <2.0.0", "client-name": "temporal-java", "grpc-accept-encoding": "gzip", "grpc-timeout": "69997880u"}

Four seconds after the request is dispatched, the caller closes the connection:

[  1085.005443s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http: linkerd_proxy_http::server: The client is shutting down the connection res=Ok(())
[  1085.005508s] TRACE ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}: linkerd_detect: Connection completed
[  1085.005519s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}: linkerd_app_core::serve: Connection closed
[  1085.005530s] TRACE ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}: linkerd_stack_tracing: drop

Later, there is a log indicating that a response includes the l5d-proxy-error header, indicating that an error was handled in the server proxy:

[  1141.459115s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.109.171:33118 server.addr=172.20.117.227:7233}:proxy{addr=172.20.117.227:7233}:http:service{ns= name=service port=0}:endpoint{addr=10.3.140.90:7233}: linkerd_app_outbound::http::handle_proxy_error_headers: Remote proxy error

You may want to look at the upstream proxy's logs to understand what error is being encountered.

You may also want to use the debug container to understand the TCP flows on both the client and server pods.

Aug 21 '24 15:08 olix0r

I'm unable to reproduce this with an example application that sleeps for 60s. Clients successfully observe responses. And in the screenshot, you include, we see other endpoints with response times higher than 50s. I'm quite confident that there is no timeout being enforced by Linkerd.

It's possible that idle connections are being lost by the conntrack table. You may want to configure the server to enable HTTP/2 keepalives so that the server actively communicates on connections with sparse request-response traffic. And, as I mentioned, you probably want to understand what error is actually arising that is causing an INTERNAL grpc-status to be set.

I'm going to close this out since there is nothing actionable for us here, but please let us know if you are able to identify a problem with Linkerd.

Aug 21 '24 17:08 olix0r

@olix0r I just enabled http2 debug and this is what I can see.

2024/10/11 03:06:45 http2: Framer 0xc0002002a0: read HEADERS flags=END_HEADERS stream=39 len=26
2024/10/11 03:06:45 http2: decoded hpack field header field ":status" = "200"
2024/10/11 03:06:45 http2: decoded hpack field header field "content-type" = "application/grpc"
2024/10/11 03:06:45 http2: decoded hpack field header field "date" = "Fri, 11 Oct 2024 03:06:44 GMT"
2024/10/11 03:06:45 http2: Framer 0xc0002002a0: read DATA stream=39 len=5 data="\x00\x00\x00\x00\x00"
2024/10/11 03:06:45 http2: Framer 0xc0002002a0: read HEADERS flags=END_STREAM|END_HEADERS stream=39 len=2
2024/10/11 03:06:45 http2: decoded hpack field header field "grpc-status" = "0"
2024/10/11 03:06:45 http2: decoded hpack field header field "grpc-message" = ""
2024/10/11 03:06:45 http2: Framer 0xc0002002a0: wrote WINDOW_UPDATE len=4 (conn) incr=5
2024/10/11 03:06:45 http2: Framer 0xc0002002a0: wrote PING len=8 ping="\x02\x04\x10\x10\t\x0e\a\a"
2024/10/11 03:06:45 http2: Framer 0xc0002002a0: wrote HEADERS flags=END_HEADERS stream=47 len=11
2024/10/11 03:06:45 http2: Framer 0xc0002002a0: read PING flags=ACK len=8 ping="\x02\x04\x10\x10\t\x0e\a\a"
2024/10/11 03:06:45 http2: Framer 0xc0002002a0: wrote DATA flags=END_STREAM stream=47 len=138 data="\x00\x00\x00\x00\x85\n\adefault\x12\x0f\n\vhello-world\x10\x01\x1a87313@vendor-service-sandbox-vendor-comm-workflow-worker@\"\v\n\t\t\x00\x00\x00\x00\x00j\xf8@*\"\n 17f8a4c649c37ff7ffc26b6c67cc30df"
2024/10/11 03:06:45 http2: Framer 0xc0002002a0: read HEADERS flags=END_STREAM|END_HEADERS stream=47 len=54
2024/10/11 03:06:45 http2: decoded hpack field header field ":status" = "200"
2024/10/11 03:06:45 http2: decoded hpack field header field "content-length" = "0"
2024/10/11 03:06:45 http2: decoded hpack field header field "content-type" = "application/grpc"
2024/10/11 03:06:45 http2: decoded hpack field header field "grpc-status" = "13"
2024/10/11 03:06:45 http2: decoded hpack field header field "grpc-message" = "unexpected error"
2024/10/11 03:06:45 http2: decoded hpack field header field "l5d-proxy-error" = "unexpected error"
2024/10/11 03:06:45 http2: decoded hpack field header field "date" = "Fri, 11 Oct 2024 03:06:44 GMT"
2024/10/11 03:06:45 http2: Framer 0xc0002002a0: read HEADERS flags=END_HEADERS stream=41 len=3
2024/10/11 03:06:45 http2: decoded hpack field header field ":status" = "200"
2024/10/11 03:06:45 http2: decoded hpack field header field "content-type" = "application/grpc"
2024/10/11 03:06:45 http2: decoded hpack field header field "date" = "Fri, 11 Oct 2024 03:06:44 GMT"
2024/10/11 03:06:45 http2: Framer 0xc0002002a0: read DATA stream=41 len=5 data="\x00\x00\x00\x00\x00"
2024/10/11 03:06:45 http2: Framer 0xc0002002a0: read HEADERS flags=END_STREAM|END_HEADERS stream=41 len=2
2024/10/11 03:06:45 http2: decoded hpack field header field "grpc-status" = "0"
2024/10/11 03:06:45 http2: decoded hpack field header field "grpc-message" = ""
2024/10/11 03:06:45 http2: Framer 0xc0002002a0: wrote WINDOW_UPDATE len=4 (conn) incr=5
time=2024-10-11T03:06:45.632Z level=WARN source=/home/prajithp/dev/go/pkg/mod/go.temporal.io/[email protected]/internal/internal_worker_base.go:458 msg="Failed to poll for task." Namespace=default TaskQueue=hello-world WorkerID=7313@workflow-worker@ WorkerType=ActivityWorker Error="unexpected error"

Oct 11 '24 03:10 prajithp13

@prajithp13 Unfortunately this doesn't really get us any closer to a repro, though it does indicate that your application is seeing an error that originates in linkerd:

2024/10/11 03:06:45 http2: decoded hpack field header field "grpc-status" = "13"
2024/10/11 03:06:45 http2: decoded hpack field header field "grpc-message" = "unexpected error"
2024/10/11 03:06:45 http2: decoded hpack field header field "l5d-proxy-error" = "unexpected error"

Do you have the accompanying proxy logs? I would expect the proxy to log a warning when emitting an unexpected error. Note that it's possible that this error originates in the server's proxy pod. It may also be appropriate to configure the relevant workloads with a pod annotation like config.linkerd.io/proxy-log-level: linkerd=debug,h2=trace,info.

Is this issue reliably reproducible? If so, it might be most expedient to try to reduce a minimal reproduction that you can share with us so we can debug this hands-on.

Oct 11 '24 03:10 olix0r

I can only see following logs in the proxy.

[   429.553649s] TRACE ThreadId(01) outbound:accept{client.addr=10.3.113.251:42754 server.addr=172.20.253.28:7233}:proxy{addr=172.20.253.28:7233}:http:server_handshake:Connection{peer=Server}:poll:FramedRead::poll_next:FramedRead::decode_frame{offset=13}: h2::codec::framed_read: decoding frame from 13B
[   429.553653s] TRACE ThreadId(01) outbound:accept{client.addr=10.3.113.251:42754 server.addr=172.20.253.28:7233}:proxy{addr=172.20.253.28:7233}:http:server_handshake:Connection{peer=Server}:poll:FramedRead::poll_next:FramedRead::decode_frame{offset=13}: h2::codec::framed_read: frame.kind=Reset
[   429.553658s] DEBUG ThreadId(01) outbound:accept{client.addr=10.3.113.251:42754 server.addr=172.20.253.28:7233}:proxy{addr=172.20.253.28:7233}:http:server_handshake:Connection{peer=Server}:poll:FramedRead::poll_next: h2::codec::framed_read: received frame=Reset { stream_id: StreamId(7), error_code: CANCEL }
[   429.553667s] TRACE ThreadId(01) outbound:accept{client.addr=10.3.113.251:42754 server.addr=172.20.253.28:7233}:proxy{addr=172.20.253.28:7233}:http:server_handshake:Connection{peer=Server}:poll: h2::proto::connection: recv RST_STREAM frame=Reset { stream_id: StreamId(7), error_code: CANCEL }
[   429.553672s] TRACE ThreadId(01) outbound:accept{client.addr=10.3.113.251:42754 server.addr=172.20.253.28:7233}:proxy{addr=172.20.253.28:7233}:http:server_handshake:Connection{peer=Server}:poll: h2::proto::streams::state: recv_reset; frame=Reset { stream_id: StreamId(7), error_code: CANCEL }; state=HalfClosedRemote(AwaitingHeaders); queued=false
[   429.553680s] TRACE ThreadId(01) outbound:accept{client.addr=10.3.113.251:42754 server.addr=172.20.253.28:7233}:proxy{addr=172.20.253.28:7233}:http:server_handshake:Connection{peer=Server}:poll: h2::proto::streams::counts: transition_after; stream=StreamId(7); state=State { inner: Closed(Error(Reset(StreamId(7), CANCEL, Remote))) }; is_closed=true; pending_send_empty=true; buffered_send_data=0; num_recv=4; num_send=0

I can set up a sample application along with steps to reproduce the error. I'll share a sample repository here shortly.

Oct 11 '24 04:10 prajithp13

@olix0r Here is the deployment YAML for both the Temporal server and worker: https://raw.githubusercontent.com/Prajithp/hello-temporal-go/refs/heads/main/deployment.yaml. You can apply this and, after waiting for at least 20 minutes, you should see the following error message in the worker pod logs: msg="Failed to poll for task."

Let me know if this works for you!

Oct 11 '24 04:10 prajithp13

I'm also working on extracting the logic between the Temporal Worker and Frontend to a simple Go app that reproduces this, will share it in a couple of days

Oct 12 '24 08:10 eli-nomasec

@olix0r did you get a chance to look into it?

Oct 13 '24 02:10 prajithp13

Thanks for sharing the repro. I haven’t had a chance to look into it yet, but I’ll update the issue once I have more information.

Oliver Gould < @.*** >

On Sat, Oct 12 2024 at 19:43, Prajith < @.*** > wrote:

@olix0r ( https://github.com/olix0r ) did you get a chance to look into it?

— Reply to this email directly, view it on GitHub ( https://github.com/linkerd/linkerd2/issues/12964#issuecomment-2408793569 ) , or unsubscribe ( https://github.com/notifications/unsubscribe-auth/AAB2YYTZDIVZW5CZTPODFEDZ3HM4TAVCNFSM6AAAAABMXXDB6CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBYG44TGNJWHE ). You are receiving this because you were mentioned. Message ID: <linkerd/linkerd2/issues/12964/2408793569 @ github. com>

Oct 13 '24 03:10 olix0r

While I'm still plumbing through some of the internals, I noticed something pretty glaring: these errors reliably manifest around the 5 minute mark. I did a quick search through the temporal codebase and found this setting:

	KeepAliveMaxConnectionAge = NewGlobalDurationSetting(
		"frontend.keepAliveMaxConnectionAge",
		5*time.Minute,
		`KeepAliveMaxConnectionAge is a duration for the maximum amount of time a
connection may exist before it will be closed by sending a GoAway. A
random jitter of +/-10% will be added to MaxConnectionAge to spread out
connection storms.`,
	)

I think there's some sort of race condition being encountered that causes us to dispatch requests on connections that have already been terminated by the server, causing the error you see propagated back to the worker.

This probably will help us narrow down a repro independent of temporal. In the meantime, you may see improvements by increasing the max connection age.

Oct 14 '24 20:10 olix0r

You may also benefit from configuring retries on a GRPCRoute so that clients are able to retry these failures.

Oct 15 '24 14:10 olix0r

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

Jan 15 '25 06:01 stale[bot]

I can confirm setting an arbitrarily long (like 1 year) max connection age seems to work around the issue. Even though it's in the dynamic config it did require a restart of the temporal frontend server for it to take effect. I guess the risk of doing this is the load can become unevenly distributed across frontends.

I think have a repro using the go grpc keepalive example.

https://github.com/teejaded/linkerd-grpc-repro/tree/main

It seems if you start a bunch of long queries with a short MaxConnectionAge it eventually fails.

Works if you disable linkerd.

Feb 26 '25 23:02 teejaded

Thanks for the repro! That should help us quite a bit

Oliver Gould < @.*** >

On Wed, Feb 26, 2025 at 15:17, TJ Miller < @.*** > wrote:

I can confirm setting an arbitrarily long (like 1 year) max connection age seems to work around the issue. Even though it's in the dynamic config it did require a restart of the temporal frontend server for it to take effect. I guess the risk of doing this is the load can become unevenly distributed across frontends.

I think have a repro using the go grpc keepalive example.

https://github.com/teejaded/linkerd-grpc-repro/tree/main

It seems if you start a bunch of long queries with a short MaxConnectionAge it eventually fails.

Works if you disable linkerd.

— Reply to this email directly, view it on GitHub ( https://github.com/linkerd/linkerd2/issues/12964#issuecomment-2686422818 ) , or unsubscribe ( https://github.com/notifications/unsubscribe-auth/AAB2YYWL5IOED5Z2WKKHSQT2RZDRTAVCNFSM6AAAAABMXXDB6CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOBWGQZDEOBRHA ). You are receiving this because you were mentioned. Message ID: <linkerd/linkerd2/issues/12964/2686422818 @ github. com>

teejaded left a comment (linkerd/linkerd2#12964) ( https://github.com/linkerd/linkerd2/issues/12964#issuecomment-2686422818 )

I can confirm setting an arbitrarily long (like 1 year) max connection age seems to work around the issue. Even though it's in the dynamic config it did require a restart of the temporal frontend server for it to take effect. I guess the risk of doing this is the load can become unevenly distributed across frontends.

I think have a repro using the go grpc keepalive example.

https://github.com/teejaded/linkerd-grpc-repro/tree/main

It seems if you start a bunch of long queries with a short MaxConnectionAge it eventually fails.

Works if you disable linkerd.

— Reply to this email directly, view it on GitHub ( https://github.com/linkerd/linkerd2/issues/12964#issuecomment-2686422818 ) , or unsubscribe ( https://github.com/notifications/unsubscribe-auth/AAB2YYWL5IOED5Z2WKKHSQT2RZDRTAVCNFSM6AAAAABMXXDB6CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOBWGQZDEOBRHA ). You are receiving this because you were mentioned. Message ID: <linkerd/linkerd2/issues/12964/2686422818 @ github. com>

Feb 26 '25 23:02 olix0r

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

May 28 '25 01:05 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

Aug 26 '25 06:08 stale[bot]

linkerd2 linkerd2 copied to clipboard

Linkerd closing connections of long-running gRPC requests

What is the issue?

How can it be reproduced?

Logs, error output, etc

output of linkerd check -o short

Environment

Possible solution

Additional context

Would you like to work on fixing this bug?

linkerd2
linkerd2 copied to clipboard

output of `linkerd check -o short`