airbyte
airbyte copied to clipboard
`airbyte-cron`: RESOURCE_EXHAUSTED namespace rate limit exceeded
Topic
Temporal issue
Revelant information
Airbyte version: 0.50.21
We are observing abnormal amount of rate limit errors from airbyte-cron. We are not using airbyte schedulers, only one cron job is setup on the Airbyte UI.
The following error message is emitted every few seconds as soon as we start the docker compose.
io.grpc.StatusRuntimeException: RESOURCE_EXHAUSTED: namespace rate limit exceeded
at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:271) ~[grpc-stub-1.54.0.jar:1.54.0]
at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:252) ~[grpc-stub-1.54.0.jar:1.54.0]
at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:165) ~[grpc-stub-1.54.0.jar:1.54.0]
at io.temporal.api.workflowservice.v1.WorkflowServiceGrpc$WorkflowServiceBlockingStub.listClosedWorkflowExecutions(WorkflowServiceGrpc.java:4011) ~[temporal-serviceclient-1.17.0.jar:?]
at io.airbyte.commons.temporal.TemporalClient.fetchClosedWorkflowsByStatus(TemporalClient.java:127) ~[io.airbyte-airbyte-commons-temporal-0.50.21.jar:?]
at io.airbyte.commons.temporal.TemporalClient.restartClosedWorkflowByStatus(TemporalClient.java:105) ~[io.airbyte-airbyte-commons-temporal-0.50.21.jar:?]
at io.airbyte.cron.jobs.SelfHealTemporalWorkflows.cleanTemporal(SelfHealTemporalWorkflows.java:40) ~[io.airbyte-airbyte-cron-0.50.21.jar:?]
at io.airbyte.cron.jobs.$SelfHealTemporalWorkflows$Definition$Exec.dispatch(Unknown Source) ~[io.airbyte-airbyte-cron-0.50.21.jar:?]
at io.micronaut.context.AbstractExecutableMethodsDefinition$DispatchedExecutableMethod.invoke(AbstractExecutableMethodsDefinition.java:371) ~[micronaut-inject-3.9.4.jar:3.9.4]
at io.micronaut.inject.DelegatingExecutableMethod.invoke(DelegatingExecutableMethod.java:76) ~[micronaut-inject-3.9.4.jar:3.9.4]
at io.micronaut.scheduling.processor.ScheduledMethodProcessor.lambda$process$5(ScheduledMethodProcessor.java:127) ~[micronaut-context-3.9.4.jar:3.9.4]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:577) ~[?:?]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:358) ~[?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
at java.lang.Thread.run(Thread.java:1589) ~[?:?]```
Somehow this problem goes away by deleting temporal and temporal_visibility database in the postgres db created by airbyte deployment and restart the instance with run-ab-platform.sh script.
Not sure if it is definitive but worth a try if you run into the same problem
Similar discussion: https://github.com/airbytehq/airbyte/discussions/30472
We experienced this issue as well with Helm chart version 0.50.20 in multiple environments. Completing these steps resolved it for us:
- Helm uninstall Airbyte
- Delete the Airbyte namespace
- Delete the temporal and temporal_visibility databases (external Postgres)
- Reinstall Airbyte
@marcosmarxm we're seeing this ever since we updated from 0.44.0 to 0.57.1, OSS. The Airbyte installation is unstable and I think this is connected:
- The
airbyte-workerservice went into a reboot loop last Friday - The logs are never rotated and quickly use up all of the disk
What might be the downsides of @joeybenamy's approach with deleting the temporal databases?
@marcosmarxm we're seeing this ever since we updated from 0.44.0 to 0.57.1, OSS. The Airbyte installation is unstable and I think this is connected:
- The
airbyte-workerservice went into a reboot loop last Friday- The logs are never rotated and quickly use up all of the disk
What might be the downsides of @joeybenamy's approach with deleting the temporal databases?
After deleting the temporal databases, there is a chance of some running sync jobs getting stuck. More specifically, cannot be run or canceled. AFAIK you will have to reset the connector to fix it.
@TimothyZhang7 thanks! Actually, it went mostly ok. I saw a few log entries about a mismatch for some of the running sync job statuses, but it's been running smoothly ever since.
That said, we still have the same problem with log rotation, it didn't go away.
@marcosmarxm we're seeing this ever since we updated from 0.44.0 to 0.57.1, OSS. The Airbyte installation is unstable and I think this is connected:
- The
airbyte-workerservice went into a reboot loop last Friday- The logs are never rotated and quickly use up all of the disk
What might be the downsides of @joeybenamy's approach with deleting the temporal databases?
After deleting the temporal databases, there is a chance of some running sync jobs getting stuck. More specifically, cannot be run or canceled. AFAIK you will have to reset the connector to fix it.
Yes, I should have mentioned that we don't do maintenance like this in Airbyte without stopping and pausing all syncs.
Hello all 👋 I reported this to the eng team. @joeybenamy are you still experiencing the issue?
Hello all 👋 I reported this to the eng team. @joeybenamy are you still experiencing the issue?
We have not encountered this issue in quite some time. Thanks for checking!
@marcosmarxm What was the final recommendation/solution for fixing this issue? Or will an official solution be included in the next release?
@marcosmarxm I have upgraded to 0.60.0 but, I am still facing rate limit error
I increased some temporal config which i got it from temporal community and reduced number of workers (10 --> 3). Error disappeared
https://community.temporal.io/t/resource-exhausted-namespace-rate-limit-exceeded-for-cron-job/7583
# when modifying, remember to update the docker-compose version of this file in temporal/dynamicconfig/development.yaml
frontend.namespaceCount:
- value: 4096
constraints: {}
frontend.namespaceRPS.visibility:
- value: 100
constraints: {}
frontend.namespaceBurst.visibility:
- value: 150
constraints: {}
frontend.namespaceRPS:
- value: 76800
constraints: {}
@sivankumar86 did you add these values to the ./temporal/dynamicconfig/development.yaml file? When I add these values airbyte fails to start correctly throwing a ton of "Failed to resolve name errors"
After upgrading to 0.60.0, we still encounter this, if its related to number of workers, here is our config: MAX_SYNC_WORKERS=10 MAX_SPEC_WORKERS=10 MAX_CHECK_WORKERS=10 MAX_DISCOVER_WORKERS=10 MAX_NOTIFY_WORKERS=5 SHOULD_RUN_NOTIFY_WORKFLOWS=true
@walker-philips I meant, replicas count. Find my conf file for reference if it helps. Verify using
k describe cm airbyte-oss-temporal-dynamicconfig # airbyte-oss name of deployment
worker:
enabled: true
replicaCount: 3
---
apiVersion: v1
kind: ConfigMap
metadata:
name: {{ include "common.names.fullname" . }}-dynamicconfig
labels:
{{- include "airbyte.labels" . | nindent 4 }}
data:
"development.yaml": |
# when modifying, remember to update the docker-compose version of this file in temporal/dynamicconfig/development.yaml
frontend.namespaceCount:
- value: 4096
constraints: {}
frontend.namespaceRPS.visibility:
- value: 100
constraints: {}
frontend.namespaceBurst.visibility:
- value: 150
constraints: {}
frontend.namespaceRPS:
- value: 76800
constraints: {}
frontend.enableClientVersionCheck:
- value: true
constraints: {}
history.persistenceMaxQPS:
- value: 3000
constraints: {}
frontend.persistenceMaxQPS:
- value: 5000
constraints: {}
frontend.historyMgrNumConns:
- value: 30
constraints: {}
frontend.throttledLogRPS:
- value: 200
constraints: {}
history.historyMgrNumConns:
- value: 50
constraints: {}
system.advancedVisibilityWritingMode:
- value: "off"
constraints: {}
history.defaultActivityRetryPolicy:
- value:
InitialIntervalInSeconds: 1
MaximumIntervalCoefficient: 100.0
BackoffCoefficient: 2.0
MaximumAttempts: 0
history.defaultWorkflowRetryPolicy:
- value:
InitialIntervalInSeconds: 1
MaximumIntervalCoefficient: 100.0
BackoffCoefficient: 2.0
MaximumAttempts: 0
# Limit for responses. This mostly impacts discovery jobs since they have the largest responses.
limit.blobSize.error:
- value: 15728640 # 15MB
constraints: {}
limit.blobSize.warn:
- value: 10485760 # 10MB
constraints: {}
@walker-philips Could you restart the temporal pod after applying changes if you have not done yet ?
@sivankumar86 Could you please explain on how to inject new key value pairs to the dynamicconfig temporal config map via Helm chart? I don't think, it is supported via the Helm chart.
@msenmurugan I download the helm chart and modify it before deploying in ci/cd pipeline .
@marcosmarxm any update on this issue ? we have similar issue each time we upgrade the airbyte version For now, i have to :
- delete temporal db
- delete tempora_history db
- run manually al connections
Just noticed this error again. @marcosmarxm could you clarify what issues might this cause?
@airbytehq/platform-move can someone take a look into this issue and check if it is possible to include in next sprint?
This is happening because the out-of-the-box temporal deployment is overloaded. The auto-setup bundled with OSS Airbyte only has 1 temporal pod. Unfortunately we aren't going to work on tuning these defaults anytime soon.
There are several reasons:
- Docker will be deprecated in ~a month.
- We are also transitioning to a different worker model, which should alleviate some of the worker restart loop concerns here.
- Since there is a lot of changes scheduled, we want to wait for the dust to settle before relooking these defaults.
I'd recommend @sivankumar86 's fix for now.
One clarification - we do not recommend deleting the temporal and temporal_visibility databases and it will cause some disruption to jobs. It should be sufficient to modify the temporal configmap like @sivankumar86 suggested here.
Is this still an issue with v1.1?
EDIT: still an issue
No more issue last time we've migrated to v1 with abctl
It is still an issue if someone upgrades with helm from an earlier non-v1 version. Editing the proposed cm airbyte-oss-temporal-dynamicconfig and recreating the cron and temporal pods, does not help.
Any suggestion on what to do when using helm to install airbyte?