temporal Persistent fetch operation Failure

Expected Behavior

Actual Behavior

Worker process does not pick up any tasks when installing via helm with Postgres AWS RDS as database. The temporal server is installed on-prem k8s cluster while the RDS is hosted in us-east-1.

Log error from history deployment.

{"level":"error","ts":"2025-06-11T11:29:11.444Z","msg":"Persistent fetch operation Failure","shard-id":228,"address":"10.233.69.127:7234","wf-namespace-id":"163abb2f-35b8-47c2-b7b3-8648e43a3c70","wf-id":"test_KnowledgeBotWorkflow_2287f9b676b84293ae17549f20557bc9","wf-run-id":"701cde3a-5f11-4a9f-9f6c-99b3d0b45384","store-operation":"get-wf-execution","error":"GetWorkflowExecution: failed to get signal info. Error: Failed to get signal info. Error: context canceled","logging-call-at":"/home/runner/work/docker-builds/docker-builds/temporal/service/history/workflow/transaction_impl.go:477","stacktrace":

Specifications

Version: 0.52
Platform: Kubernetes

Related issue:

https://community.temporal.io/t/worker-process-does-not-pick-workflow-and-activity-tasks/8084

Jun 11 '25 11:06 thatmlopsguy

It seems that the Temporal pod is highly sensitive to delays and timeouts when interacting with the database. However, I couldn’t find any environment configuration that could mitigate this issue.

From https://github.com/airbytehq/airbyte/issues/59730

Jun 11 '25 14:06 thatmlopsguy

Assuming the delays and timeouts are caused by the load from temporal server, I would suggest tuning some dynamic configurations to rate limit the # of requests sent to DB and make sure DB in a happy state. For example, history.persistenceGlobalMaxQPS which controls the RPS of persistence requests all the history service hosts can send.

Please also consider joining our open source slack channel and/or forum (https://temporal.io/community). Our support team has a lot of experience helping with self-hosting issues.

Jul 10 '25 23:07 yycptt

Closing the issue.

Aug 21 '25 22:08 yycptt