airbyte icon indicating copy to clipboard operation
airbyte copied to clipboard

`airbyte-cron`: RESOURCE_EXHAUSTED namespace rate limit exceeded

Open TimothyZhang7 opened this issue 2 years ago • 25 comments
trafficstars

Topic

Temporal issue

Revelant information

Airbyte version: 0.50.21

We are observing abnormal amount of rate limit errors from airbyte-cron. We are not using airbyte schedulers, only one cron job is setup on the Airbyte UI.

The following error message is emitted every few seconds as soon as we start the docker compose.

io.grpc.StatusRuntimeException: RESOURCE_EXHAUSTED: namespace rate limit exceeded
    at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:271) ~[grpc-stub-1.54.0.jar:1.54.0]
    at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:252) ~[grpc-stub-1.54.0.jar:1.54.0]
    at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:165) ~[grpc-stub-1.54.0.jar:1.54.0]
    at io.temporal.api.workflowservice.v1.WorkflowServiceGrpc$WorkflowServiceBlockingStub.listClosedWorkflowExecutions(WorkflowServiceGrpc.java:4011) ~[temporal-serviceclient-1.17.0.jar:?]
    at io.airbyte.commons.temporal.TemporalClient.fetchClosedWorkflowsByStatus(TemporalClient.java:127) ~[io.airbyte-airbyte-commons-temporal-0.50.21.jar:?]
    at io.airbyte.commons.temporal.TemporalClient.restartClosedWorkflowByStatus(TemporalClient.java:105) ~[io.airbyte-airbyte-commons-temporal-0.50.21.jar:?]
    at io.airbyte.cron.jobs.SelfHealTemporalWorkflows.cleanTemporal(SelfHealTemporalWorkflows.java:40) ~[io.airbyte-airbyte-cron-0.50.21.jar:?]
    at io.airbyte.cron.jobs.$SelfHealTemporalWorkflows$Definition$Exec.dispatch(Unknown Source) ~[io.airbyte-airbyte-cron-0.50.21.jar:?]
    at io.micronaut.context.AbstractExecutableMethodsDefinition$DispatchedExecutableMethod.invoke(AbstractExecutableMethodsDefinition.java:371) ~[micronaut-inject-3.9.4.jar:3.9.4]
    at io.micronaut.inject.DelegatingExecutableMethod.invoke(DelegatingExecutableMethod.java:76) ~[micronaut-inject-3.9.4.jar:3.9.4]
    at io.micronaut.scheduling.processor.ScheduledMethodProcessor.lambda$process$5(ScheduledMethodProcessor.java:127) ~[micronaut-context-3.9.4.jar:3.9.4]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:577) ~[?:?]
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:358) ~[?:?]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
    at java.lang.Thread.run(Thread.java:1589) ~[?:?]```

TimothyZhang7 avatar Sep 22 '23 16:09 TimothyZhang7

Somehow this problem goes away by deleting temporal and temporal_visibility database in the postgres db created by airbyte deployment and restart the instance with run-ab-platform.sh script. Not sure if it is definitive but worth a try if you run into the same problem

TimothyZhang7 avatar Oct 03 '23 16:10 TimothyZhang7

Similar discussion: https://github.com/airbytehq/airbyte/discussions/30472

marcosmarxm avatar Nov 10 '23 14:11 marcosmarxm

We experienced this issue as well with Helm chart version 0.50.20 in multiple environments. Completing these steps resolved it for us:

  1. Helm uninstall Airbyte
  2. Delete the Airbyte namespace
  3. Delete the temporal and temporal_visibility databases (external Postgres)
  4. Reinstall Airbyte

joeybenamy avatar Feb 13 '24 14:02 joeybenamy

@marcosmarxm we're seeing this ever since we updated from 0.44.0 to 0.57.1, OSS. The Airbyte installation is unstable and I think this is connected:

  • The airbyte-worker service went into a reboot loop last Friday
  • The logs are never rotated and quickly use up all of the disk

What might be the downsides of @joeybenamy's approach with deleting the temporal databases?

killthekitten avatar Apr 22 '24 07:04 killthekitten

@marcosmarxm we're seeing this ever since we updated from 0.44.0 to 0.57.1, OSS. The Airbyte installation is unstable and I think this is connected:

  • The airbyte-worker service went into a reboot loop last Friday
  • The logs are never rotated and quickly use up all of the disk

What might be the downsides of @joeybenamy's approach with deleting the temporal databases?

After deleting the temporal databases, there is a chance of some running sync jobs getting stuck. More specifically, cannot be run or canceled. AFAIK you will have to reset the connector to fix it.

TimothyZhang7 avatar May 06 '24 03:05 TimothyZhang7

@TimothyZhang7 thanks! Actually, it went mostly ok. I saw a few log entries about a mismatch for some of the running sync job statuses, but it's been running smoothly ever since.

That said, we still have the same problem with log rotation, it didn't go away.

killthekitten avatar May 06 '24 09:05 killthekitten

@marcosmarxm we're seeing this ever since we updated from 0.44.0 to 0.57.1, OSS. The Airbyte installation is unstable and I think this is connected:

  • The airbyte-worker service went into a reboot loop last Friday
  • The logs are never rotated and quickly use up all of the disk

What might be the downsides of @joeybenamy's approach with deleting the temporal databases?

After deleting the temporal databases, there is a chance of some running sync jobs getting stuck. More specifically, cannot be run or canceled. AFAIK you will have to reset the connector to fix it.

Yes, I should have mentioned that we don't do maintenance like this in Airbyte without stopping and pausing all syncs.

joeybenamy avatar May 06 '24 14:05 joeybenamy

Hello all 👋 I reported this to the eng team. @joeybenamy are you still experiencing the issue?

marcosmarxm avatar May 09 '24 14:05 marcosmarxm

Hello all 👋 I reported this to the eng team. @joeybenamy are you still experiencing the issue?

We have not encountered this issue in quite some time. Thanks for checking!

joeybenamy avatar May 09 '24 15:05 joeybenamy

@marcosmarxm What was the final recommendation/solution for fixing this issue? Or will an official solution be included in the next release?

walker-philips avatar May 14 '24 13:05 walker-philips

@marcosmarxm I have upgraded to 0.60.0 but, I am still facing rate limit error

sivankumar86 avatar May 21 '24 02:05 sivankumar86

I increased some temporal config which i got it from temporal community and reduced number of workers (10 --> 3). Error disappeared

https://community.temporal.io/t/resource-exhausted-namespace-rate-limit-exceeded-for-cron-job/7583

   # when modifying, remember to update the docker-compose version of this file in temporal/dynamicconfig/development.yaml
    frontend.namespaceCount:
      - value: 4096
        constraints: {}
    frontend.namespaceRPS.visibility:
      - value: 100
        constraints: {}
    frontend.namespaceBurst.visibility:
      - value: 150
        constraints: {}
    frontend.namespaceRPS:
      - value: 76800
        constraints: {}

sivankumar86 avatar May 23 '24 03:05 sivankumar86

@sivankumar86 did you add these values to the ./temporal/dynamicconfig/development.yaml file? When I add these values airbyte fails to start correctly throwing a ton of "Failed to resolve name errors"

After upgrading to 0.60.0, we still encounter this, if its related to number of workers, here is our config: MAX_SYNC_WORKERS=10 MAX_SPEC_WORKERS=10 MAX_CHECK_WORKERS=10 MAX_DISCOVER_WORKERS=10 MAX_NOTIFY_WORKERS=5 SHOULD_RUN_NOTIFY_WORKFLOWS=true

walker-philips avatar May 23 '24 15:05 walker-philips

@walker-philips I meant, replicas count. Find my conf file for reference if it helps. Verify using

k describe cm airbyte-oss-temporal-dynamicconfig # airbyte-oss name of deployment

worker:
  enabled: true
  replicaCount: 3

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: {{ include "common.names.fullname" . }}-dynamicconfig
  labels:
    {{- include "airbyte.labels" . | nindent 4 }}
data:
  "development.yaml": |
    # when modifying, remember to update the docker-compose version of this file in temporal/dynamicconfig/development.yaml
    frontend.namespaceCount:
      - value: 4096
        constraints: {}
    frontend.namespaceRPS.visibility:
      - value: 100
        constraints: {}
    frontend.namespaceBurst.visibility:
      - value: 150
        constraints: {}
    frontend.namespaceRPS:
      - value: 76800
        constraints: {}
    frontend.enableClientVersionCheck:
      - value: true
        constraints: {}
    history.persistenceMaxQPS:
      - value: 3000
        constraints: {}
    frontend.persistenceMaxQPS:
      - value: 5000
        constraints: {}
    frontend.historyMgrNumConns:
      - value: 30
        constraints: {}
    frontend.throttledLogRPS:
      - value: 200
        constraints: {}
    history.historyMgrNumConns:
      - value: 50
        constraints: {}
    system.advancedVisibilityWritingMode:
      - value: "off"
        constraints: {}
    history.defaultActivityRetryPolicy:
      - value:
          InitialIntervalInSeconds: 1
          MaximumIntervalCoefficient: 100.0
          BackoffCoefficient: 2.0
          MaximumAttempts: 0
    history.defaultWorkflowRetryPolicy:
      - value:
          InitialIntervalInSeconds: 1
          MaximumIntervalCoefficient: 100.0
          BackoffCoefficient: 2.0
          MaximumAttempts: 0
    # Limit for responses. This mostly impacts discovery jobs since they have the largest responses.
    limit.blobSize.error:
      - value: 15728640 # 15MB
        constraints: {}
    limit.blobSize.warn:
      - value: 10485760 # 10MB
        constraints: {}

sivankumar86 avatar May 23 '24 17:05 sivankumar86

@walker-philips Could you restart the temporal pod after applying changes if you have not done yet ?

sivankumar86 avatar May 23 '24 17:05 sivankumar86

@sivankumar86 Could you please explain on how to inject new key value pairs to the dynamicconfig temporal config map via Helm chart? I don't think, it is supported via the Helm chart.

msenmurugan avatar Jun 01 '24 16:06 msenmurugan

@msenmurugan I download the helm chart and modify it before deploying in ci/cd pipeline .

sivankumar86 avatar Jun 01 '24 22:06 sivankumar86

@marcosmarxm any update on this issue ? we have similar issue each time we upgrade the airbyte version For now, i have to :

  • delete temporal db
  • delete tempora_history db
  • run manually al connections

lideke avatar Jul 01 '24 08:07 lideke

Just noticed this error again. @marcosmarxm could you clarify what issues might this cause?

killthekitten avatar Jul 17 '24 18:07 killthekitten

@airbytehq/platform-move can someone take a look into this issue and check if it is possible to include in next sprint?

marcosmarxm avatar Jul 17 '24 18:07 marcosmarxm

This is happening because the out-of-the-box temporal deployment is overloaded. The auto-setup bundled with OSS Airbyte only has 1 temporal pod. Unfortunately we aren't going to work on tuning these defaults anytime soon.

There are several reasons:

  • Docker will be deprecated in ~a month.
  • We are also transitioning to a different worker model, which should alleviate some of the worker restart loop concerns here.
  • Since there is a lot of changes scheduled, we want to wait for the dust to settle before relooking these defaults.

I'd recommend @sivankumar86 's fix for now.

davinchia avatar Jul 17 '24 22:07 davinchia

One clarification - we do not recommend deleting the temporal and temporal_visibility databases and it will cause some disruption to jobs. It should be sufficient to modify the temporal configmap like @sivankumar86 suggested here.

davinchia avatar Aug 28 '24 19:08 davinchia

Is this still an issue with v1.1?

EDIT: still an issue

dimisjim avatar Oct 11 '24 10:10 dimisjim

No more issue last time we've migrated to v1 with abctl

lideke avatar Oct 14 '24 15:10 lideke

It is still an issue if someone upgrades with helm from an earlier non-v1 version. Editing the proposed cm airbyte-oss-temporal-dynamicconfig and recreating the cron and temporal pods, does not help.

Any suggestion on what to do when using helm to install airbyte?

dimisjim avatar Oct 14 '24 15:10 dimisjim