Persistent fetch operation Failure
Expected Behavior
Actual Behavior
Worker process does not pick up any tasks when installing via helm with Postgres AWS RDS as database. The temporal server is installed on-prem k8s cluster while the RDS is hosted in us-east-1.
Log error from history deployment.
{"level":"error","ts":"2025-06-11T11:29:11.444Z","msg":"Persistent fetch operation Failure","shard-id":228,"address":"10.233.69.127:7234","wf-namespace-id":"163abb2f-35b8-47c2-b7b3-8648e43a3c70","wf-id":"test_KnowledgeBotWorkflow_2287f9b676b84293ae17549f20557bc9","wf-run-id":"701cde3a-5f11-4a9f-9f6c-99b3d0b45384","store-operation":"get-wf-execution","error":"GetWorkflowExecution: failed to get signal info. Error: Failed to get signal info. Error: context canceled","logging-call-at":"/home/runner/work/docker-builds/docker-builds/temporal/service/history/workflow/transaction_impl.go:477","stacktrace":
Specifications
- Version: 0.52
- Platform: Kubernetes
Related issue:
- https://community.temporal.io/t/worker-process-does-not-pick-workflow-and-activity-tasks/8084
It seems that the Temporal pod is highly sensitive to delays and timeouts when interacting with the database. However, I couldn’t find any environment configuration that could mitigate this issue.
From https://github.com/airbytehq/airbyte/issues/59730
Assuming the delays and timeouts are caused by the load from temporal server, I would suggest tuning some dynamic configurations to rate limit the # of requests sent to DB and make sure DB in a happy state. For example, history.persistenceGlobalMaxQPS which controls the RPS of persistence requests all the history service hosts can send.
Please also consider joining our open source slack channel and/or forum (https://temporal.io/community). Our support team has a lot of experience helping with self-hosting issues.
Closing the issue.