airbyte icon indicating copy to clipboard operation
airbyte copied to clipboard

[helm] Pod creation fails with timeout

Open jnatten opened this issue 1 year ago • 4 comments

Helm Chart Version

0.445.3

What step the error happened?

During the Sync

Relevant information

When running a sync-job. Creating a new destination or anything that spawns a new pod the frontend complains about unknown error (HTTP 504) and The provided log appears.

I have a similar test-cluster with the exact same configuration that works just fine. And I have attempted to install a completely fresh airbyte install in a new namespace.

Running on AWS EKS if it matters.

Any suggestions on how to fix it or how i should continue debugging would be greatly appreciated!

Relevant log output

2024-08-20 09:24:38 ERROR i.a.w.l.p.h.FailureHandler(apply):39 - Pipeline Error
io.airbyte.workload.launcher.pipeline.stages.model.StageError: io.airbyte.workload.launcher.pods.KubeClientException: Failed to create pod source-file-check-b43bf659-7773-4cf5-b204-8c37bd657c20-0-izuis.
  at io.airbyte.workload.launcher.pipeline.stages.model.Stage.apply(Stage.kt:46) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pipeline.stages.LaunchPodStage.apply(LaunchPodStage.kt:38) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pipeline.stages.$LaunchPodStage$Definition$Intercepted.$$access$$apply(Unknown Source) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pipeline.stages.$LaunchPodStage$Definition$Exec.dispatch(Unknown Source) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.micronaut.context.AbstractExecutableMethodsDefinition$DispatchedExecutableMethod.invoke(AbstractExecutableMethodsDefinition.java:456) ~[micronaut-inject-4.5.4.jar:4.5.4]
  at io.micronaut.aop.chain.MethodInterceptorChain.proceed(MethodInterceptorChain.java:129) ~[micronaut-aop-4.5.4.jar:4.5.4]
  at io.airbyte.metrics.interceptors.InstrumentInterceptorBase.doIntercept(InstrumentInterceptorBase.kt:61) ~[io.airbyte.airbyte-metrics-metrics-lib-0.63.18.jar:?]
  at io.airbyte.metrics.interceptors.InstrumentInterceptorBase.intercept(InstrumentInterceptorBase.kt:44) ~[io.airbyte.airbyte-metrics-metrics-lib-0.63.18.jar:?]
  at io.micronaut.aop.chain.MethodInterceptorChain.proceed(MethodInterceptorChain.java:138) ~[micronaut-aop-4.5.4.jar:4.5.4]
  at io.airbyte.workload.launcher.pipeline.stages.$LaunchPodStage$Definition$Intercepted.apply(Unknown Source) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pipeline.stages.LaunchPodStage.apply(LaunchPodStage.kt:24) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:132) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:158) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:158) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:158) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.Operators$ScalarSubscription.request(Operators.java:2571) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.MonoFlatMap$FlatMapMain.request(MonoFlatMap.java:194) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.MonoFlatMap$FlatMapMain.request(MonoFlatMap.java:194) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.MonoFlatMap$FlatMapMain.request(MonoFlatMap.java:194) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.MonoFlatMap$FlatMapMain.request(MonoFlatMap.java:194) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.Operators$MultiSubscriptionSubscriber.set(Operators.java:2367) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onSubscribe(FluxOnErrorResume.java:74) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.MonoFlatMap$FlatMapMain.onSubscribe(MonoFlatMap.java:117) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.MonoFlatMap$FlatMapMain.onSubscribe(MonoFlatMap.java:117) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.MonoFlatMap$FlatMapMain.onSubscribe(MonoFlatMap.java:117) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.MonoFlatMap$FlatMapMain.onSubscribe(MonoFlatMap.java:117) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.FluxFlatMap.trySubscribeScalarMap(FluxFlatMap.java:193) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.MonoFlatMap.subscribeOrReturn(MonoFlatMap.java:53) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.Mono.subscribe(Mono.java:4552) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.MonoSubscribeOn$SubscribeOnSubscriber.run(MonoSubscribeOn.java:126) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.scheduler.ImmediateScheduler$ImmediateSchedulerWorker.schedule(ImmediateScheduler.java:84) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.MonoSubscribeOn.subscribeOrReturn(MonoSubscribeOn.java:55) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.Mono.subscribe(Mono.java:4552) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.Mono.subscribeWith(Mono.java:4634) ~[reactor-core-3.6.8.jar:3.6.8]
  at reactor.core.publisher.Mono.subscribe(Mono.java:4395) ~[reactor-core-3.6.8.jar:3.6.8]
  at io.airbyte.workload.launcher.pipeline.LaunchPipeline.accept(LaunchPipeline.kt:50) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pipeline.consumer.LauncherMessageConsumer.consume(LauncherMessageConsumer.kt:28) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pipeline.consumer.LauncherMessageConsumer.consume(LauncherMessageConsumer.kt:12) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.commons.temporal.queue.QueueActivityImpl.consume(Internal.kt:87) ~[io.airbyte-airbyte-commons-temporal-core-0.63.18.jar:?]
  at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103) ~[?:?]
  at java.base/java.lang.reflect.Method.invoke(Method.java:580) ~[?:?]
  at io.temporal.internal.activity.RootActivityInboundCallsInterceptor$POJOActivityInboundCallsInterceptor.executeActivity(RootActivityInboundCallsInterceptor.java:64) ~[temporal-sdk-1.22.3.jar:?]
  at io.temporal.internal.activity.RootActivityInboundCallsInterceptor.execute(RootActivityInboundCallsInterceptor.java:43) ~[temporal-sdk-1.22.3.jar:?]
  at io.temporal.common.interceptors.ActivityInboundCallsInterceptorBase.execute(ActivityInboundCallsInterceptorBase.java:39) ~[temporal-sdk-1.22.3.jar:?]
  at io.temporal.opentracing.internal.OpenTracingActivityInboundCallsInterceptor.execute(OpenTracingActivityInboundCallsInterceptor.java:78) ~[temporal-opentracing-1.22.3.jar:?]
  at io.temporal.internal.activity.ActivityTaskExecutors$BaseActivityTaskExecutor.execute(ActivityTaskExecutors.java:107) ~[temporal-sdk-1.22.3.jar:?]
  at io.temporal.internal.activity.ActivityTaskHandlerImpl.handle(ActivityTaskHandlerImpl.java:124) ~[temporal-sdk-1.22.3.jar:?]
  at io.temporal.internal.worker.ActivityWorker$TaskHandlerImpl.handleActivity(ActivityWorker.java:278) ~[temporal-sdk-1.22.3.jar:?]
  at io.temporal.internal.worker.ActivityWorker$TaskHandlerImpl.handle(ActivityWorker.java:243) ~[temporal-sdk-1.22.3.jar:?]
  at io.temporal.internal.worker.ActivityWorker$TaskHandlerImpl.handle(ActivityWorker.java:216) ~[temporal-sdk-1.22.3.jar:?]
  at io.temporal.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:105) ~[temporal-sdk-1.22.3.jar:?]
  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
  at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
Caused by: io.airbyte.workload.launcher.pods.KubeClientException: Failed to create pod source-file-check-b43bf659-7773-4cf5-b204-8c37bd657c20-0-izuis.
  at io.airbyte.workload.launcher.pods.KubePodClient.launchConnectorWithSidecar(KubePodClient.kt:287) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pods.KubePodClient.launchCheck(KubePodClient.kt:214) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pipeline.stages.LaunchPodStage.applyStage(LaunchPodStage.kt:44) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pipeline.stages.LaunchPodStage.applyStage(LaunchPodStage.kt:24) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pipeline.stages.model.Stage.apply(Stage.kt:42) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  ... 53 more
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Operation: [patch]  for kind: [Pod]  with name: [source-file-check-b43bf659-7773-4cf5-b204-8c37bd657c20-0-izuis]  in namespace: [airbyte]  failed.
  at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:159) ~[kubernetes-client-api-6.12.1.jar:?]
  at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.lambda$patch$2(HasMetadataOperation.java:233) ~[kubernetes-client-6.12.1.jar:?]
  at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.patch(HasMetadataOperation.java:236) ~[kubernetes-client-6.12.1.jar:?]
  at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.patch(HasMetadataOperation.java:251) ~[kubernetes-client-6.12.1.jar:?]
  at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.serverSideApply(BaseOperation.java:1179) ~[kubernetes-client-6.12.1.jar:?]
  at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.serverSideApply(BaseOperation.java:98) ~[kubernetes-client-6.12.1.jar:?]
  at io.airbyte.workload.launcher.pods.KubePodLauncher$create$1.invoke(KubePodLauncher.kt:57) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pods.KubePodLauncher$create$1.invoke(KubePodLauncher.kt:52) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pods.KubePodLauncher.runKubeCommand$lambda$0(KubePodLauncher.kt:307) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at dev.failsafe.Functions.lambda$toCtxSupplier$11(Functions.java:243) ~[failsafe-3.3.2.jar:3.3.2]
  at dev.failsafe.Functions.lambda$get$0(Functions.java:46) ~[failsafe-3.3.2.jar:3.3.2]
  at dev.failsafe.internal.RetryPolicyExecutor.lambda$apply$0(RetryPolicyExecutor.java:74) ~[failsafe-3.3.2.jar:3.3.2]
  at dev.failsafe.SyncExecutionImpl.executeSync(SyncExecutionImpl.java:187) ~[failsafe-3.3.2.jar:3.3.2]
  at dev.failsafe.FailsafeExecutor.call(FailsafeExecutor.java:376) ~[failsafe-3.3.2.jar:3.3.2]
  at dev.failsafe.FailsafeExecutor.get(FailsafeExecutor.java:112) ~[failsafe-3.3.2.jar:3.3.2]
  at io.airbyte.workload.launcher.pods.KubePodLauncher.runKubeCommand(KubePodLauncher.kt:307) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pods.KubePodLauncher.create(KubePodLauncher.kt:52) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pods.KubePodClient.launchConnectorWithSidecar(KubePodClient.kt:284) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pods.KubePodClient.launchCheck(KubePodClient.kt:214) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pipeline.stages.LaunchPodStage.applyStage(LaunchPodStage.kt:44) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pipeline.stages.LaunchPodStage.applyStage(LaunchPodStage.kt:24) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pipeline.stages.model.Stage.apply(Stage.kt:42) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  ... 53 more
Caused by: java.io.IOException: timeout
  at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:504) ~[kubernetes-client-6.12.1.jar:?]
  at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:524) ~[kubernetes-client-6.12.1.jar:?]
  at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handlePatch(OperationSupport.java:419) ~[kubernetes-client-6.12.1.jar:?]
  at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handlePatch(OperationSupport.java:397) ~[kubernetes-client-6.12.1.jar:?]
  at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handlePatch(BaseOperation.java:764) ~[kubernetes-client-6.12.1.jar:?]
  at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.lambda$patch$2(HasMetadataOperation.java:231) ~[kubernetes-client-6.12.1.jar:?]
  at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.patch(HasMetadataOperation.java:236) ~[kubernetes-client-6.12.1.jar:?]
  at io.fabric8.kubernetes.client.dsl.internal.HasMetadataOperation.patch(HasMetadataOperation.java:251) ~[kubernetes-client-6.12.1.jar:?]
  at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.serverSideApply(BaseOperation.java:1179) ~[kubernetes-client-6.12.1.jar:?]
  at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.serverSideApply(BaseOperation.java:98) ~[kubernetes-client-6.12.1.jar:?]
  at io.airbyte.workload.launcher.pods.KubePodLauncher$create$1.invoke(KubePodLauncher.kt:57) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pods.KubePodLauncher$create$1.invoke(KubePodLauncher.kt:52) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pods.KubePodLauncher.runKubeCommand$lambda$0(KubePodLauncher.kt:307) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at dev.failsafe.Functions.lambda$toCtxSupplier$11(Functions.java:243) ~[failsafe-3.3.2.jar:3.3.2]
  at dev.failsafe.Functions.lambda$get$0(Functions.java:46) ~[failsafe-3.3.2.jar:3.3.2]
  at dev.failsafe.internal.RetryPolicyExecutor.lambda$apply$0(RetryPolicyExecutor.java:74) ~[failsafe-3.3.2.jar:3.3.2]
  at dev.failsafe.SyncExecutionImpl.executeSync(SyncExecutionImpl.java:187) ~[failsafe-3.3.2.jar:3.3.2]
  at dev.failsafe.FailsafeExecutor.call(FailsafeExecutor.java:376) ~[failsafe-3.3.2.jar:3.3.2]
  at dev.failsafe.FailsafeExecutor.get(FailsafeExecutor.java:112) ~[failsafe-3.3.2.jar:3.3.2]
  at io.airbyte.workload.launcher.pods.KubePodLauncher.runKubeCommand(KubePodLauncher.kt:307) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pods.KubePodLauncher.create(KubePodLauncher.kt:52) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pods.KubePodClient.launchConnectorWithSidecar(KubePodClient.kt:284) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pods.KubePodClient.launchCheck(KubePodClient.kt:214) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pipeline.stages.LaunchPodStage.applyStage(LaunchPodStage.kt:44) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pipeline.stages.LaunchPodStage.applyStage(LaunchPodStage.kt:24) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  at io.airbyte.workload.launcher.pipeline.stages.model.Stage.apply(Stage.kt:42) ~[io.airbyte-airbyte-workload-launcher-0.63.18.jar:?]
  ... 53 more
Caused by: java.io.InterruptedIOException: timeout
  at okhttp3.internal.connection.RealCall.timeoutExit(RealCall.kt:398) ~[okhttp-4.12.0.jar:?]
  at okhttp3.internal.connection.RealCall.callDone(RealCall.kt:360) ~[okhttp-4.12.0.jar:?]
  at okhttp3.internal.connection.RealCall.noMoreExchanges$okhttp(RealCall.kt:325) ~[okhttp-4.12.0.jar:?]
  at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:209) ~[okhttp-4.12.0.jar:?]
  at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:517) ~[okhttp-4.12.0.jar:?]
  ... 3 more
Caused by: java.io.IOException: Canceled
  at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:72) ~[okhttp-4.12.0.jar:?]
  at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109) ~[okhttp-4.12.0.jar:?]
  at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:201) ~[okhttp-4.12.0.jar:?]
  at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:517) ~[okhttp-4.12.0.jar:?]
  ... 3 more
2024-08-20 09:24:38 INFO i.a.w.l.c.WorkloadApiClient(updateStatusToFailed):54 - Attempting to update workload: 778daa7c-feaf-4db6-96f3-70fd645acc77_b43bf659-7773-4cf5-b204-8c37bd657c20_0_check to FAILED.
2024-08-20 09:24:38 INFO i.a.w.l.p.h.FailureHandler(apply):62 - Pipeline aborted after error for workload: 778daa7c-feaf-4db6-96f3-70fd645acc77_b43bf659-7773-4cf5-b204-8c37bd657c20_0_check.

jnatten avatar Aug 20 '24 09:08 jnatten

After some investigation i figured out the problem goes away if i add a rule to our security group that allows all tcp traffic from control-plane to worker nodes.

Not sure why it is needed or why it worked without previously, but this seems to solve the issue consistently for us for now. Is there a specific port that is needed?

jnatten avatar Aug 22 '24 12:08 jnatten

@davinchia can you check if this issue?

marcosmarxm avatar Aug 22 '24 17:08 marcosmarxm

@jnatten strange. Does your cluster have special security rules set up? We run Airbyte Cloud on EKS and have never seen this issue.

davinchia avatar Aug 23 '24 04:08 davinchia

Not sure if they are special, but the previous security group setup were something like this:

Worker node -> Cluster: 443 Worker node -> Worker node 53,1025 - 65535 Cluster -> worker node: 443,4443,6443,8443,9443,10250 Worker node -> outside world: all open

Think all of it is from the terraform eks module, but i could be wrong on that.

After allowing all ports from cluster -> worker nodes it started working. Not sure if we need all or just some, but i don't think its an issue for us to keep them open.

jnatten avatar Aug 23 '24 07:08 jnatten

Happened to me after trying to upgrade a cluster. Had to helm uninstall and re-install and then it worked fine.

Elsayed91 avatar Sep 16 '24 14:09 Elsayed91

At Airbyte, we seek to be clear about the project priorities and roadmap. This issue has not had any activity for 180 days, suggesting that it's not as critical as others. It's possible it has already been fixed. It is being marked as stale and will be closed in 20 days if there is no activity. To keep it open, please comment to let us know why it is important to you and if it is still reproducible on recent versions of Airbyte.

octavia-squidington-iii avatar Mar 19 '25 09:03 octavia-squidington-iii

This issue was closed because it has been inactive for 20 days since being marked as stale.

octavia-squidington-iii avatar Apr 09 '25 09:04 octavia-squidington-iii