strimzi-kafka-operator
strimzi-kafka-operator copied to clipboard
Kafka Connect build not recognized as complete due to pod sidecar container
Describe the bug
Despite the Kafka Connect build container successfully completing, the operator's status check did not recognize the completion because a permanently running sidecar container was still in the running state. Even if that sidecar was changed to terminate when it's the only container left running in a pod, the code in src/main/java/io/strimzi/operator/cluster/operator/assembly/KafkaConnectAssemblyOperator.java incorrectly assumes there will only be one container in the pod, so it's possible it will instead check the final status of the sidecar container depending on the order of the list returned from getContainers().
To Reproduce There's probably an easier way to reproduce this, but for us:
- Setup Istio with automatic sidecar injection
- Create KafkaConnect custom resource w/ build specified
build:
output:
type: docker
image: <image>
plugins:
- name: camel-salesforce
artifacts:
- type: jar
url: https://repo.maven.apache.org/maven2/org/apache/camel/kafkaconnector/camel-salesforce-kafka-connector/0.8.0/camel-salesforce-kafka-connector-0.8.0.jar
- Wait for Status of build container to be
Complete,kubectl get pod -o yaml <build_pod> - Build gets "stuck", eventually times out and operator restarts the build
Expected behavior The operator continues the Kafka Connect deployment upon successful completion of the build container
Environment (please complete the following information):
- Strimzi version: 0.21.1
- Installation method: Helm chart
- Kubernetes cluster: Kubernetes 1.18
- Infrastructure: Amazon EKS, Istio
YAML files and logs
Operator logs showing eventual timeout:
2021-03-18 20:16:08 DEBUG PodOperator:107 - Pods events-cluster/connect-cluster-connect-build does not exist, creating it
2021-03-18 20:16:09 DEBUG PodOperator:218 - Pods connect-cluster-connect-build in namespace events-cluster has been created
2021-03-18 20:16:09 DEBUG Util:94 - Waiting for Pods resource connect-cluster-connect-build in namespace events-cluster to get complete
2021-03-18 20:17:55 INFO ClusterOperator:125 - Triggering periodic reconciliation for namespace events-cluster...
2021-03-18 20:17:55 DEBUG AbstractOperator:356 - Reconciliation #504(timer) KafkaConnect(events-cluster/connect-cluster): Try to acquire lock lock::events-cluster::KafkaConnect::connect-cluster
2021-03-18 20:18:05 WARN AbstractOperator:379 - Reconciliation #504(timer) KafkaConnect(events-cluster/connect-cluster): Failed to acquire lock lock::events-cluster::KafkaConnect::connect-cluster within 10000ms.
2021-03-18 20:19:55 INFO ClusterOperator:125 - Triggering periodic reconciliation for namespace events-cluster...
2021-03-18 20:19:55 DEBUG AbstractOperator:356 - Reconciliation #505(timer) KafkaConnect(events-cluster/connect-cluster): Try to acquire lock lock::events-cluster::KafkaConnect::connect-cluster
2021-03-18 20:20:05 WARN AbstractOperator:379 - Reconciliation #505(timer) KafkaConnect(events-cluster/connect-cluster): Failed to acquire lock lock::events-cluster::KafkaConnect::connect-cluster within 10000ms.
2021-03-18 20:21:09 ERROR Util:125 - Exceeded timeout of 300000ms while waiting for Pods resource connect-cluster-connect-build in namespace events-cluster to be complete
2021-03-18 20:21:09 ERROR AbstractOperator:240 - Reconciliation #503(timer) KafkaConnect(events-cluster/connect-cluster): createOrUpdate failed
io.strimzi.operator.common.operator.resource.TimeoutException: Exceeded timeout of 300000ms while waiting for Pods resource connect-cluster-connect-build in namespace events-cluster to be complete
Status of build pod showing completion of build container:
status:
phase: Running
conditions:
- type: Initialized
status: 'True'
lastProbeTime: null
lastTransitionTime: '2021-03-18T22:09:55Z'
- type: Ready
status: 'False'
lastProbeTime: null
lastTransitionTime: '2021-03-18T22:10:40Z'
reason: ContainersNotReady
message: >-
containers with unready status:
[connect-cluster-connect-build]
- type: ContainersReady
status: 'False'
lastProbeTime: null
lastTransitionTime: '2021-03-18T22:10:40Z'
reason: ContainersNotReady
message: >-
containers with unready status:
[connect-cluster-connect-build]
- type: PodScheduled
status: 'True'
lastProbeTime: null
lastTransitionTime: '2021-03-18T22:09:53Z'
containerStatuses:
- name: istio-proxy
state:
running:
startedAt: '2021-03-18T22:09:55Z'
lastState: {}
ready: true
restartCount: 0
containerID: >-
docker://991316bf212b7a71dbe95e5fd735069bb23af57f7abba90d754b4589cb7370ee
started: true
- name: connect-cluster-connect-build
state:
terminated:
exitCode: 0
reason: Completed
message: >
*********.dkr.ecr.***.amazonaws.com/****/******@sha256:4813e6e629c879d235ce4ee30b8cd62ed99f14a12ecab17ad6d96d8695963a2f
startedAt: '2021-03-18T22:09:58Z'
finishedAt: '2021-03-18T22:10:40Z'
containerID: >-
docker://998ac6252b6195d519b941a77056aa4a6150bbde357a345bcc578422ac1006e7
lastState: {}
ready: false
restartCount: 0
image: 'gcr.io/kaniko-project/executor:v1.3.0'
imageID: >-
docker-pullable://gcr.io/kaniko-project/executor@sha256:99eec410fa32cd77cdb7685c70f86a96debb8b087e77e63d7fe37eaadb178709
containerID: >-
docker://998ac6252b6195d519b941a77056aa4a6150bbde357a345bcc578422ac1006e7
started: false
Additional context As a workaround I was able to use the template for the build pod to add an annotation to disable Istio sidecar injection for the pod, but it seems like Strimzi could be more resilient here by checking explicitly for what it cares about (the build container status by container name/id being complete).
template:
buildPod:
metadata:
annotations:
sidecar.istio.io/inject: "false"
We do not support / expect any injected Istio or other sidecars. And I do not really see the value they would bring. So I'm not sure I see this as a bug.
Ah I didn't realize that was explicitly not supported. Makes sense, but I expect a lot of production Kubernetes clusters will have some form of sidecar injection for one reason or another. It would be nice if Strimzi was able to function as long as the sidecars don't directly interfere with anything the operator or pods it creates are trying to do. Mostly just so users don't have to individually debug scenarios like this (took me awhile to figure out what was going on).
I think this is pretty low priority regardless given that you can expect there to be some mechanism for disabling sidecar injection in a namespace/pod. Feel free to close, hopefully anyone else with this problem will find their way here
TBH, the Kafka Connect Build might be the only part where it does not interfere that much. Also, you are the first to report this.
@scholzj the value comes when users are already using a mesh: observability, traffic rules, authentication, tls, certificates management, etc
I'm using strimzi for the first time in a namespace with istio and I guess this forces me to deploy kafka clusters in their own namespaces with their own rules. I lose the all security and observability already in my mesh as well as the benefits of using CRDs
Banzai cloud has a good article about this https://banzaicloud.com/blog/kafka-on-istio-benefits/
I'm not sure if this is the right issue to discuss it since it talks specifically about the Connect Build pod.
But in general - as I explained to someone else on Slack few minutes ago - many people do not use Istio and do not want to run Istio because of Strimzi. I'm sure you have good reasons to use Istio. But I'm sure they have good reasons to not use it as well. That makes supporting Istio fairly complicated because you need to support and maintain everything outside of Istio + support and maintain everything inside Istio. So you have in some areas basically two parallel implementations you need to develop and maintain.
Banzai Cloud made their bet on Istio and AFAIK their Kafka operator runs only on Istio. Strimzi on the other hand started outside of Istio and right now runs outside of Istio only. We are not necessarily opposed to supporting Istio as well. But it will means lot of effort, time and commitment which we did not had so far. We simply did/do not have time for it so far. So if you are interested in contributing and maintaining the Istio support in the long term we are more than happy to work with you.
That's fair enough. Maybe this issue is not a bug and more a feature request? But in any case, what you say makes total sense
Triaged on 7th July 2022: Should be fixed.