strimzi-kafka-operator icon indicating copy to clipboard operation
strimzi-kafka-operator copied to clipboard

[Bug] Incorrectly configured `STRIMZI_KUBERNETES_VERSION` makes the operator stuck

Open scholzj opened this issue 7 months ago • 4 comments

If you configure the STRIMZI_KUBERNETES_VERSION environment variable in the CO to something that is not a map (e.g. set it to STRIMZI_KUBERNETES_VERSION=1.33.0 instead of a valid map such as STRIMZI_KUBERNETES_VERSION="major=1, minor=33"), it will fail to parse the map and get an exception which is not handled. But because the futures are not failed properly, it remains stuck in a running state without terminating forever.

The full log from the CO is here:

Auto-detected KUBERNETES_SERVICE_DNS_DOMAIN: cluster.local
+ exec /usr/bin/tini -w -e 143 -- java -Dlog4j2.configurationFile=file:/opt/strimzi/custom-config/log4j2.properties -Dvertx.cacheDirBase=/tmp/vertx-cache -Djava.security.egd=file:/dev/./urandom -XX:+ExitOnOutOfMemoryError -XX:MinRAMPercentage=15 -XX:MaxRAMPercentage=20 -classpath :lib/io.strimzi.cluster-operator-0.46.0.jar:lib/io.strimzi.api-0.46.0.jar:lib/io.fabric8.kubernetes-model-apiextensions-7.1.0.jar:lib/io.strimzi.config-model-0.46.0.jar:lib/io.strimzi.operator-common-0.46.0.jar:lib/io.strimzi.crd-annotations-0.46.0.jar:lib/org.quartz-scheduler.quartz-2.3.2.jar:lib/org.eclipse.jetty.jetty-server-12.0.16.jar:lib/org.eclipse.jetty.jetty-io-12.0.16.jar:lib/org.eclipse.jetty.jetty-util-12.0.16.jar:lib/org.eclipse.jetty.jetty-http-12.0.16.jar:lib/javax.servlet.javax.servlet-api-3.1.0.jar:lib/io.fabric8.openshift-client-7.1.0.jar:lib/io.fabric8.openshift-client-api-7.1.0.jar:lib/io.fabric8.openshift-model-autoscaling-7.1.0.jar:lib/io.fabric8.openshift-model-operator-7.1.0.jar:lib/io.fabric8.openshift-model-operatorhub-7.1.0.jar:lib/io.fabric8.openshift-model-machine-7.1.0.jar:lib/io.fabric8.openshift-model-whereabouts-7.1.0.jar:lib/io.fabric8.openshift-model-monitoring-7.1.0.jar:lib/io.fabric8.openshift-model-storageversionmigrator-7.1.0.jar:lib/io.fabric8.openshift-model-tuned-7.1.0.jar:lib/io.fabric8.openshift-model-console-7.1.0.jar:lib/io.fabric8.openshift-model-config-7.1.0.jar:lib/io.fabric8.openshift-model-machineconfiguration-7.1.0.jar:lib/io.fabric8.openshift-model-miscellaneous-7.1.0.jar:lib/io.fabric8.openshift-model-hive-7.1.0.jar:lib/io.fabric8.openshift-model-installer-7.1.0.jar:lib/io.fabric8.kubernetes-client-7.1.0.jar:lib/io.fabric8.kubernetes-httpclient-jdk-7.1.0.jar:lib/io.fabric8.kubernetes-client-api-7.1.0.jar:lib/io.fabric8.kubernetes-model-gatewayapi-7.1.0.jar:lib/io.fabric8.kubernetes-model-resource-7.1.0.jar:lib/io.fabric8.kubernetes-model-admissionregistration-7.1.0.jar:lib/io.fabric8.kubernetes-model-autoscaling-7.1.0.jar:lib/io.fabric8.kubernetes-model-batch-7.1.0.jar:lib/io.fabric8.kubernetes-model-certificates-7.1.0.jar:lib/io.fabric8.kubernetes-model-discovery-7.1.0.jar:lib/io.fabric8.kubernetes-model-extensions-7.1.0.jar:lib/io.fabric8.kubernetes-model-flowcontrol-7.1.0.jar:lib/io.fabric8.kubernetes-model-metrics-7.1.0.jar:lib/io.fabric8.kubernetes-model-scheduling-7.1.0.jar:lib/io.fabric8.kubernetes-model-node-7.1.0.jar:lib/org.slf4j.slf4j-api-1.7.36.jar:lib/org.snakeyaml.snakeyaml-engine-2.9.jar:lib/com.fasterxml.jackson.datatype.jackson-datatype-jsr310-2.18.3.jar:lib/io.fabric8.kubernetes-model-core-7.1.0.jar:lib/javax.annotation.javax.annotation-api-1.3.2.jar:lib/io.fabric8.kubernetes-model-common-7.1.0.jar:lib/io.fabric8.kubernetes-model-events-7.1.0.jar:lib/io.fabric8.kubernetes-model-policy-7.1.0.jar:lib/io.fabric8.kubernetes-model-rbac-7.1.0.jar:lib/io.fabric8.kubernetes-model-apps-7.1.0.jar:lib/io.fabric8.kubernetes-model-storageclass-7.1.0.jar:lib/io.fabric8.kubernetes-model-networking-7.1.0.jar:lib/io.fabric8.kubernetes-model-coordination-7.1.0.jar:lib/io.fabric8.openshift-model-7.1.0.jar:lib/io.fabric8.zjsonpatch-0.3.0.jar:lib/com.fasterxml.jackson.core.jackson-core-2.18.3.jar:lib/com.fasterxml.jackson.core.jackson-databind-2.18.3.jar:lib/com.fasterxml.jackson.core.jackson-annotations-2.18.3.jar:lib/com.fasterxml.jackson.dataformat.jackson-dataformat-yaml-2.18.3.jar:lib/org.yaml.snakeyaml-2.3.jar:lib/io.vertx.vertx-core-4.5.14.jar:lib/io.netty.netty-common-4.1.118.Final.jar:lib/io.netty.netty-buffer-4.1.118.Final.jar:lib/io.netty.netty-handler-4.1.118.Final.jar:lib/io.netty.netty-handler-proxy-4.1.118.Final.jar:lib/io.netty.netty-codec-http-4.1.118.Final.jar:lib/io.netty.netty-codec-http2-4.1.118.Final.jar:lib/io.netty.netty-resolver-4.1.118.Final.jar:lib/io.netty.netty-resolver-dns-4.1.118.Final.jar:lib/io.netty.netty-codec-dns-4.1.118.Final.jar:lib/io.strimzi.certificate-manager-0.46.0.jar:lib/org.apache.logging.log4j.log4j-core-2.17.2.jar:lib/org.apache.logging.log4j.log4j-api-2.17.2.jar:lib/org.apache.logging.log4j.log4j-slf4j-impl-2.17.2.jar:lib/io.strimzi.kafka-oauth-server-0.16.2.jar:lib/io.strimzi.kafka-oauth-server-plain-0.16.2.jar:lib/io.strimzi.kafka-oauth-client-0.16.2.jar:lib/io.strimzi.kafka-oauth-common-0.16.2.jar:lib/com.nimbusds.nimbus-jose-jwt-9.37.2.jar:lib/com.github.stephenc.jcip.jcip-annotations-1.0-1.jar:lib/com.jayway.jsonpath.json-path-2.9.0.jar:lib/net.minidev.json-smart-2.5.2.jar:lib/net.minidev.accessors-smart-2.5.2.jar:lib/org.ow2.asm.asm-9.7.1.jar:lib/org.apache.kafka.kafka-clients-4.0.0.jar:lib/com.github.luben.zstd-jni-1.5.6-6.jar:lib/org.lz4.lz4-java-1.8.0.jar:lib/org.xerial.snappy.snappy-java-1.1.10.5.jar:lib/org.apache.kafka.kafka-server-common-4.0.0.jar:lib/com.yammer.metrics.metrics-core-2.2.0.jar:lib/net.sf.jopt-simple.jopt-simple-5.0.4.jar:lib/org.pcollections.pcollections-4.0.1.jar:lib/io.netty.netty-transport-4.1.118.Final.jar:lib/io.netty.netty-transport-native-epoll-4.1.118.Final-linux-x86_64.jar:lib/io.netty.netty-transport-native-unix-common-4.1.118.Final.jar:lib/io.netty.netty-transport-classes-epoll-4.1.118.Final.jar:lib/io.netty.netty-transport-native-epoll-4.1.118.Final-linux-aarch_64.jar:lib/io.micrometer.micrometer-core-1.12.13.jar:lib/io.micrometer.micrometer-commons-1.12.13.jar:lib/io.micrometer.micrometer-observation-1.12.13.jar:lib/org.hdrhistogram.HdrHistogram-2.1.12.jar:lib/org.latencyutils.LatencyUtils-2.0.3.jar:lib/io.micrometer.micrometer-registry-prometheus-1.12.13.jar:lib/io.prometheus.simpleclient_common-0.16.0.jar:lib/io.prometheus.simpleclient-0.16.0.jar:lib/io.prometheus.simpleclient_tracer_otel-0.16.0.jar:lib/io.prometheus.simpleclient_tracer_common-0.16.0.jar:lib/io.prometheus.simpleclient_tracer_otel_agent-0.16.0.jar:lib/io.vertx.vertx-micrometer-metrics-4.5.14.jar:lib/io.netty.netty-codec-4.1.118.Final.jar:lib/io.netty.netty-codec-socks-4.1.118.Final.jar io.strimzi.operator.cluster.Main
2025-04-26 20:19:39 INFO  Main:59 - ClusterOperator 0.46.0 is starting
2025-04-26 20:19:39 INFO  Util:196 - Using config:
	PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
	container: oci
	STRIMZI_DEFAULT_TOPIC_OPERATOR_IMAGE: quay.io/strimzi/operator:0.46.0-rc1
	JAVA_OPTS:  -Dlog4j2.configurationFile=file:/opt/strimzi/custom-config/log4j2.properties -Dvertx.cacheDirBase=/tmp/vertx-cache -Djava.security.egd=file:/dev/./urandom  -XX:+ExitOnOutOfMemoryError -XX:MinRAMPercentage=15 -XX:MaxRAMPercentage=20
	STRIMZI_HOME: /opt/strimzi
	PWD: /opt/strimzi
	STRIMZI_LEADER_ELECTION_IDENTITY: strimzi-cluster-operator-657bcf899-kmqpx
	STRIMZI_KUBERNETES_VERSION: 1.33.0
	JAVA_MAIN: io.strimzi.operator.cluster.Main
	KUBERNETES_PORT_443_TCP: tcp://10.96.0.1:443
	STRIMZI_LEADER_ELECTION_ENABLED: true
	STRIMZI_VERSION: 0.46.0
	STRIMZI_DEFAULT_KANIKO_EXECUTOR_IMAGE: quay.io/strimzi/kaniko-executor:0.46.0-rc1
	STRIMZI_NAMESPACE: myproject
	TINI_SHA256_S390X: 931b70a182af879ca249ae9de87ef68423121b38d235c78997fafc680ceab32d
	STRIMZI_DEFAULT_MAVEN_BUILDER: quay.io/strimzi/maven-builder:0.46.0-rc1
	STRIMZI_FEATURE_GATES:
	TINI_SHA256_AMD64: 93dcc18adc78c65a028a84799ecf8ad40c936fdfc5f2a57b1acda5a8117fa82c
	KUBERNETES_SERVICE_DNS_DOMAIN: cluster.local
	STRIMZI_LEADER_ELECTION_LEASE_NAME: strimzi-cluster-operator
	KUBERNETES_SERVICE_PORT_HTTPS: 443
	SHLVL: 0
	STRIMZI_DEFAULT_KAFKA_BRIDGE_IMAGE: quay.io/strimzi/kafka-bridge:0.32.0
	KUBERNETES_PORT: tcp://10.96.0.1:443
	JAVA_HOME: /usr/lib/jvm/jre-17
	TERM: xterm
	STRIMZI_FULL_RECONCILIATION_INTERVAL_MS: 120000
	KUBERNETES_SERVICE_HOST: 10.96.0.1
	JAVA_CLASSPATH: :lib/io.strimzi.cluster-operator-0.46.0.jar:lib/io.strimzi.api-0.46.0.jar:lib/io.fabric8.kubernetes-model-apiextensions-7.1.0.jar:lib/io.strimzi.config-model-0.46.0.jar:lib/io.strimzi.operator-common-0.46.0.jar:lib/io.strimzi.crd-annotations-0.46.0.jar:lib/org.quartz-scheduler.quartz-2.3.2.jar:lib/org.eclipse.jetty.jetty-server-12.0.16.jar:lib/org.eclipse.jetty.jetty-io-12.0.16.jar:lib/org.eclipse.jetty.jetty-util-12.0.16.jar:lib/org.eclipse.jetty.jetty-http-12.0.16.jar:lib/javax.servlet.javax.servlet-api-3.1.0.jar:lib/io.fabric8.openshift-client-7.1.0.jar:lib/io.fabric8.openshift-client-api-7.1.0.jar:lib/io.fabric8.openshift-model-autoscaling-7.1.0.jar:lib/io.fabric8.openshift-model-operator-7.1.0.jar:lib/io.fabric8.openshift-model-operatorhub-7.1.0.jar:lib/io.fabric8.openshift-model-machine-7.1.0.jar:lib/io.fabric8.openshift-model-whereabouts-7.1.0.jar:lib/io.fabric8.openshift-model-monitoring-7.1.0.jar:lib/io.fabric8.openshift-model-storageversionmigrator-7.1.0.jar:lib/io.fabric8.openshift-model-tuned-7.1.0.jar:lib/io.fabric8.openshift-model-console-7.1.0.jar:lib/io.fabric8.openshift-model-config-7.1.0.jar:lib/io.fabric8.openshift-model-machineconfiguration-7.1.0.jar:lib/io.fabric8.openshift-model-miscellaneous-7.1.0.jar:lib/io.fabric8.openshift-model-hive-7.1.0.jar:lib/io.fabric8.openshift-model-installer-7.1.0.jar:lib/io.fabric8.kubernetes-client-7.1.0.jar:lib/io.fabric8.kubernetes-httpclient-jdk-7.1.0.jar:lib/io.fabric8.kubernetes-client-api-7.1.0.jar:lib/io.fabric8.kubernetes-model-gatewayapi-7.1.0.jar:lib/io.fabric8.kubernetes-model-resource-7.1.0.jar:lib/io.fabric8.kubernetes-model-admissionregistration-7.1.0.jar:lib/io.fabric8.kubernetes-model-autoscaling-7.1.0.jar:lib/io.fabric8.kubernetes-model-batch-7.1.0.jar:lib/io.fabric8.kubernetes-model-certificates-7.1.0.jar:lib/io.fabric8.kubernetes-model-discovery-7.1.0.jar:lib/io.fabric8.kubernetes-model-extensions-7.1.0.jar:lib/io.fabric8.kubernetes-model-flowcontrol-7.1.0.jar:lib/io.fabric8.kubernetes-model-metrics-7.1.0.jar:lib/io.fabric8.kubernetes-model-scheduling-7.1.0.jar:lib/io.fabric8.kubernetes-model-node-7.1.0.jar:lib/org.slf4j.slf4j-api-1.7.36.jar:lib/org.snakeyaml.snakeyaml-engine-2.9.jar:lib/com.fasterxml.jackson.datatype.jackson-datatype-jsr310-2.18.3.jar:lib/io.fabric8.kubernetes-model-core-7.1.0.jar:lib/javax.annotation.javax.annotation-api-1.3.2.jar:lib/io.fabric8.kubernetes-model-common-7.1.0.jar:lib/io.fabric8.kubernetes-model-events-7.1.0.jar:lib/io.fabric8.kubernetes-model-policy-7.1.0.jar:lib/io.fabric8.kubernetes-model-rbac-7.1.0.jar:lib/io.fabric8.kubernetes-model-apps-7.1.0.jar:lib/io.fabric8.kubernetes-model-storageclass-7.1.0.jar:lib/io.fabric8.kubernetes-model-networking-7.1.0.jar:lib/io.fabric8.kubernetes-model-coordination-7.1.0.jar:lib/io.fabric8.openshift-model-7.1.0.jar:lib/io.fabric8.zjsonpatch-0.3.0.jar:lib/com.fasterxml.jackson.core.jackson-core-2.18.3.jar:lib/com.fasterxml.jackson.core.jackson-databind-2.18.3.jar:lib/com.fasterxml.jackson.core.jackson-annotations-2.18.3.jar:lib/com.fasterxml.jackson.dataformat.jackson-dataformat-yaml-2.18.3.jar:lib/org.yaml.snakeyaml-2.3.jar:lib/io.vertx.vertx-core-4.5.14.jar:lib/io.netty.netty-common-4.1.118.Final.jar:lib/io.netty.netty-buffer-4.1.118.Final.jar:lib/io.netty.netty-handler-4.1.118.Final.jar:lib/io.netty.netty-handler-proxy-4.1.118.Final.jar:lib/io.netty.netty-codec-http-4.1.118.Final.jar:lib/io.netty.netty-codec-http2-4.1.118.Final.jar:lib/io.netty.netty-resolver-4.1.118.Final.jar:lib/io.netty.netty-resolver-dns-4.1.118.Final.jar:lib/io.netty.netty-codec-dns-4.1.118.Final.jar:lib/io.strimzi.certificate-manager-0.46.0.jar:lib/org.apache.logging.log4j.log4j-core-2.17.2.jar:lib/org.apache.logging.log4j.log4j-api-2.17.2.jar:lib/org.apache.logging.log4j.log4j-slf4j-impl-2.17.2.jar:lib/io.strimzi.kafka-oauth-server-0.16.2.jar:lib/io.strimzi.kafka-oauth-server-plain-0.16.2.jar:lib/io.strimzi.kafka-oauth-client-0.16.2.jar:lib/io.strimzi.kafka-oauth-common-0.16.2.jar:lib/com.nimbusds.nimbus-jose-jwt-9.37.2.jar:lib/com.github.stephenc.jcip.jcip-annotations-1.0-1.jar:lib/com.jayway.jsonpath.json-path-2.9.0.jar:lib/net.minidev.json-smart-2.5.2.jar:lib/net.minidev.accessors-smart-2.5.2.jar:lib/org.ow2.asm.asm-9.7.1.jar:lib/org.apache.kafka.kafka-clients-4.0.0.jar:lib/com.github.luben.zstd-jni-1.5.6-6.jar:lib/org.lz4.lz4-java-1.8.0.jar:lib/org.xerial.snappy.snappy-java-1.1.10.5.jar:lib/org.apache.kafka.kafka-server-common-4.0.0.jar:lib/com.yammer.metrics.metrics-core-2.2.0.jar:lib/net.sf.jopt-simple.jopt-simple-5.0.4.jar:lib/org.pcollections.pcollections-4.0.1.jar:lib/io.netty.netty-transport-4.1.118.Final.jar:lib/io.netty.netty-transport-native-epoll-4.1.118.Final-linux-x86_64.jar:lib/io.netty.netty-transport-native-unix-common-4.1.118.Final.jar:lib/io.netty.netty-transport-classes-epoll-4.1.118.Final.jar:lib/io.netty.netty-transport-native-epoll-4.1.118.Final-linux-aarch_64.jar:lib/io.micrometer.micrometer-core-1.12.13.jar:lib/io.micrometer.micrometer-commons-1.12.13.jar:lib/io.micrometer.micrometer-observation-1.12.13.jar:lib/org.hdrhistogram.HdrHistogram-2.1.12.jar:lib/org.latencyutils.LatencyUtils-2.0.3.jar:lib/io.micrometer.micrometer-registry-prometheus-1.12.13.jar:lib/io.prometheus.simpleclient_common-0.16.0.jar:lib/io.prometheus.simpleclient-0.16.0.jar:lib/io.prometheus.simpleclient_tracer_otel-0.16.0.jar:lib/io.prometheus.simpleclient_tracer_common-0.16.0.jar:lib/io.prometheus.simpleclient_tracer_otel_agent-0.16.0.jar:lib/io.vertx.vertx-micrometer-metrics-4.5.14.jar:lib/io.netty.netty-codec-4.1.118.Final.jar:lib/io.netty.netty-codec-socks-4.1.118.Final.jar
	STRIMZI_DEFAULT_CRUISE_CONTROL_IMAGE: quay.io/strimzi/kafka:0.46.0-rc1-kafka-4.0.0
	TINI_VERSION: v0.19.0
	STRIMZI_OPERATION_TIMEOUT_MS: 300000
	KUBERNETES_PORT_443_TCP_ADDR: 10.96.0.1
	STRIMZI_LEADER_ELECTION_LEASE_NAMESPACE: myproject
	STRIMZI_KAFKA_MIRROR_MAKER_2_IMAGES: 3.9.0=quay.io/strimzi/kafka:0.46.0-rc1-kafka-3.9.0
4.0.0=quay.io/strimzi/kafka:0.46.0-rc1-kafka-4.0.0

	KUBERNETES_PORT_443_TCP_PROTO: tcp
	STRIMZI_DEFAULT_USER_OPERATOR_IMAGE: quay.io/strimzi/operator:0.46.0-rc1
	KUBERNETES_SERVICE_PORT: 443
	STRIMZI_DEFAULT_KAFKA_EXPORTER_IMAGE: quay.io/strimzi/kafka:0.46.0-rc1-kafka-4.0.0
	TINI_SHA256_ARM64: 07952557df20bfd2a95f9bef198b445e006171969499a1d361bd9e6f8e5e0e81
	STRIMZI_DEFAULT_KAFKA_INIT_IMAGE: quay.io/strimzi/operator:0.46.0-rc1
	TINI_SHA256_PPC64LE: 3f658420974768e40810001a038c29d003728c5fe86da211cff5059e48cfdfde
	HOSTNAME: strimzi-cluster-operator-657bcf899-kmqpx
	STRIMZI_KAFKA_CONNECT_IMAGES: 3.9.0=quay.io/strimzi/kafka:0.46.0-rc1-kafka-3.9.0
4.0.0=quay.io/strimzi/kafka:0.46.0-rc1-kafka-4.0.0

	STRIMZI_KAFKA_IMAGES: 3.9.0=quay.io/strimzi/kafka:0.46.0-rc1-kafka-3.9.0
4.0.0=quay.io/strimzi/kafka:0.46.0-rc1-kafka-4.0.0

	KUBERNETES_PORT_443_TCP_PORT: 443
	STRIMZI_OPERATOR_NAME: strimzi-cluster-operator-657bcf899-kmqpx
	HOME: /home/strimzi
	STRIMZI_OPERATOR_NAMESPACE: myproject
	MALLOC_ARENA_MAX: 2

2025-04-26 20:19:40 INFO  Main:62 - Cluster Operator configuration is ClusterOperatorConfig{
	namespaces='[myproject]'
	reconciliationIntervalMs=120000
	operationTimeoutMs=300000
	connectBuildTimeoutMs=300000
	networkPolicyGeneration=true
	versions='versions{3.9.0={metadata: 3.9 kafka-image: quay.io/strimzi/kafka:0.46.0-rc1-kafka-3.9.0 connect-image: quay.io/strimzi/kafka:0.46.0-rc1-kafka-3.9.0 mirrormaker2-image: quay.io/strimzi/kafka:0.46.0-rc1-kafka-3.9.0}, 4.0.0={metadata: 4.0 kafka-image: quay.io/strimzi/kafka:0.46.0-rc1-kafka-4.0.0 connect-image: quay.io/strimzi/kafka:0.46.0-rc1-kafka-4.0.0 mirrormaker2-image: quay.io/strimzi/kafka:0.46.0-rc1-kafka-4.0.0}}'
	imagePullPolicy='null'
	imagePullSecrets='null'
	operatorNamespace='myproject'
	operatorNamespaceLabels='null'
	customResourceSelector='null'
	featureGates='FeatureGates(DummyFeatureGate=false)'
	dnsCacheTtlSec=30
	podSetReconciliationOnly=false
	podSetControllerWorkQueueSize=1024
	operatorName='strimzi-cluster-operator-657bcf899-kmqpx'
	podSecurityProviderClass='io.strimzi.plugin.security.profiles.impl.BaselinePodSecurityProvider'
	leaderElectionConfig='LeaderElectionConfig{leaseName='strimzi-cluster-operator', namespace='myproject', identity='strimzi-cluster-operator-657bcf899-kmqpx', leaseDuration=PT15S, renewDeadline=PT10S, retryPeriod=PT2S}'
	podDisruptionBudgetGeneration=true}
2025-04-26 20:19:42 INFO  Main:274 - Health and metrics server is ready on port 8080)
2025-04-26 20:19:42 INFO  Main:236 - Waiting to become a leader
2025-04-26 20:19:42 INFO  LeaderElectionManager:69 - Starting the Leader Elector
2025-04-26 20:19:44 INFO  LeaderElectionManager:121 - The new leader is strimzi-cluster-operator-657bcf899-nbpx6
2025-04-26 20:19:44 INFO  LeaderElectionManager:121 - The new leader is strimzi-cluster-operator-657bcf899-kmqpx
2025-04-26 20:19:44 INFO  LeaderElectionManager:103 - Started being a leader
2025-04-26 20:19:44 INFO  Main:218 - I'm the new leader
2025-04-26 20:19:44 ERROR Main:90 - Unable to start operator for 1 or more namespace
java.lang.RuntimeException: Failed to parse Map from String
	at io.strimzi.operator.common.Util.parseMap(Util.java:104) ~[io.strimzi.operator-common-0.46.0.jar:0.46.0]
	at io.strimzi.operator.cluster.PlatformFeaturesAvailability.parseVersionInfo(PlatformFeaturesAvailability.java:112) ~[io.strimzi.cluster-operator-0.46.0.jar:0.46.0]
	at io.strimzi.operator.cluster.PlatformFeaturesAvailability.getVersionInfo(PlatformFeaturesAvailability.java:100) ~[io.strimzi.cluster-operator-0.46.0.jar:0.46.0]
	at io.strimzi.operator.cluster.PlatformFeaturesAvailability.create(PlatformFeaturesAvailability.java:46) ~[io.strimzi.cluster-operator-0.46.0.jar:0.46.0]
	at io.strimzi.operator.cluster.Main.createPlatformFeaturesAvailability(Main.java:108) ~[io.strimzi.cluster-operator-0.46.0.jar:0.46.0]
	at io.strimzi.operator.cluster.Main.lambda$main$2(Main.java:86) ~[io.strimzi.cluster-operator-0.46.0.jar:0.46.0]
	at io.vertx.core.impl.future.Composition.onSuccess(Composition.java:38) ~[io.vertx.vertx-core-4.5.14.jar:4.5.14]
	at io.vertx.core.impl.future.FutureBase.emitSuccess(FutureBase.java:66) ~[io.vertx.vertx-core-4.5.14.jar:4.5.14]
	at io.vertx.core.impl.future.FutureImpl.tryComplete(FutureImpl.java:259) ~[io.vertx.vertx-core-4.5.14.jar:4.5.14]
	at io.vertx.core.impl.future.Composition$1.onSuccess(Composition.java:62) ~[io.vertx.vertx-core-4.5.14.jar:4.5.14]
	at io.vertx.core.impl.future.FutureBase.emitSuccess(FutureBase.java:66) ~[io.vertx.vertx-core-4.5.14.jar:4.5.14]
	at io.vertx.core.impl.future.FutureImpl.tryComplete(FutureImpl.java:259) ~[io.vertx.vertx-core-4.5.14.jar:4.5.14]
	at io.vertx.core.Promise.tryComplete(Promise.java:121) ~[io.vertx.vertx-core-4.5.14.jar:4.5.14]
	at io.vertx.core.Promise.complete(Promise.java:77) ~[io.vertx.vertx-core-4.5.14.jar:4.5.14]
	at io.strimzi.operator.cluster.Main.lambda$leaderElection$8(Main.java:219) ~[io.strimzi.cluster-operator-0.46.0.jar:0.46.0]
	at io.strimzi.operator.cluster.leaderelection.LeaderElectionManager.startLeadershipHandler(LeaderElectionManager.java:104) ~[io.strimzi.cluster-operator-0.46.0.jar:0.46.0]
	at io.fabric8.kubernetes.client.extended.leaderelection.LeaderCallbacks.onStartLeading(LeaderCallbacks.java:34) ~[io.fabric8.kubernetes-client-api-7.1.0.jar:?]
	at io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.updateObserved(LeaderElector.java:266) ~[io.fabric8.kubernetes-client-api-7.1.0.jar:?]
	at io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.tryAcquireOrRenew(LeaderElector.java:248) ~[io.fabric8.kubernetes-client-api-7.1.0.jar:?]
	at io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.lambda$acquire$4(LeaderElector.java:177) ~[io.fabric8.kubernetes-client-api-7.1.0.jar:?]
	at io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.lambda$loop$8(LeaderElector.java:292) ~[io.fabric8.kubernetes-client-api-7.1.0.jar:?]
	at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at java.lang.Thread.run(Thread.java:840) ~[?:?]
2025-04-26 20:19:44 INFO  ShutdownHook:35 - Shutdown hook started
2025-04-26 20:19:44 INFO  LeaderElectionManager:82 - Stopping the Leader Elector

The same seems to happen when we catch the ParseException in the code. We do not fail the future either when we catch it, so it if fails to parse the date (which is what throws it - STRIMZI_KUBERNETES_VERSION="major=1, minor=33, buildDate=not-a-build-date") it remains also stuck and never terminates.

But it does start the shutdown hook regardless the future not being failed. So I'm not sure it is as simple as failing the future. The problem might be that the shutdown hook gets stuck when things fail too early?

scholzj avatar Apr 26 '25 21:04 scholzj

As last log is about the leader elector, could it be something related to it which is not able to stop properly for any reasons? Just guessing.

ppatierno avatar Apr 30 '25 11:04 ppatierno

As last log is about the leader elector, could it be something related to it which is not able to stop properly for any reasons? Just guessing.

I don't really know. It seems to get stuck in the shutdown, so seemingly unrelated to the version decoding. But that said, I did not observed it when the Kube client failed to decode the version. Only in the branch where we fail decode the version from the environment variable :-/

scholzj avatar Apr 30 '25 14:04 scholzj

Triaged on 29.5.2025: We should investigate where the operator gets stuck and we should fix it.

im-konge avatar May 29 '25 08:05 im-konge

I just got this on k3s, after setting extraEnv operator was able to start: STRIMZI_KUBERNETES_VERSION="major=1, minor=33

XBeg9 avatar May 30 '25 21:05 XBeg9

May you assign it to me ? @scholzj / @im-konge . Let me try to investigate it deeply and likely open a PR with the fix.

rlanhellas avatar Jul 01 '25 23:07 rlanhellas

Assigned to you @rlanhellas , thanks and if you will want some help, please let us know :)

im-konge avatar Jul 02 '25 08:07 im-konge

@scholzj the problem happens because the LeaderElection can't be stopped with cancel() . Looking in the LeaderElection logic inside the library it keeps retrying renew leader election forever, and there is no check if CompletableFuture is cancelled, so cancel() has no effect at all and get stuck waiting for cancel() to return, but it will never return.

The question is we really need gracefully shutdown the leader election ? IMO we don't need, after 15s (lease time) the other POD will automatically get the leadership, so we don't need release or cancel or stop it manually.

My suggestion is to remove stop() method from ShutdownHook, I tested and worked. A important thing to consider here, any ShutdownHook trigger will fall on the same stuck problem, not just this specific version parse problem.

rlanhellas avatar Jul 03 '25 01:07 rlanhellas

We do prefer clean shutdown. Among other things because clean shutdown should allow for faster handover tot he backup replicas.

scholzj avatar Jul 03 '25 08:07 scholzj

@scholzj I see your point, seems like it's a problem with the current kubernetes-client but I will investigate a bit more if there is a way to keep the graceful shutdown for the leader election

rlanhellas avatar Jul 03 '25 14:07 rlanhellas

@scholzj I see your point, seems like it's a problem with the current kubernetes-client but I will investigate a bit more if there is a way to keep the graceful shutdown for the leader election

It could be an issue in the client - in which case we would need to report (and possibly fix) it there. It might be also an issue in how we use it. In which case we would need to change that on our side. Ultimately, that needs to be investigated as part of this issue.

scholzj avatar Jul 04 '25 06:07 scholzj

@scholzj after some more investigation I was able to identify the real problem, a deadlock. Let me try to go step by step on this.

1- When you call leaderElection.start() a new thread is created, let's call it LE-Thread, it holds a lock on LeaderElector object, since it has synchronized methods. 2- This LE-Thread emits an event which trigger our callback. 3- From now on we are using LE-Thread, and we call version parse which fails and force us to fall into onComplete(). 4- Inside onComplete() we call System.exit(1) when it fails (our case). The way system.exit works internally is creating a new thread to execute each hook. Let's call it Exit-LE-Thread.

We have 2 threads now: LE-Thread and Exit-LE-Thread. Exit-LE-Thread will call the hook to cancel() the leaderElection, but this operation requires a lock on LeaderElection object, but remember that LE-Thread holds that lock. Here is the error where the Exit-LE-Thread is waiting for the LE-Thread releases the lock:

-1563528849-pool-3-thread-1@6284 is the LE-Thread

 waiting for -1563528849-pool-3-thread-1@6284 to release lock on instance of io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector(id=6246)

Then we have a deadlock here:

LE Thread --> waits for --> Exit-LE Thread ---> which waits for -->LE thread releases lock.

How to fix this deadlock ? Doing the System.exit() execute outside LE-Thread, so the lock will be released automatically since the flow will continue and we don't stuck on that. Here is the suggestion:

LE Thread --> create a new Thread (without waiting for it) --> new Thread waits for -->Exit-LE Thread ---> which waits for --> LE thread releases lock.

.onComplete(res -> {
                    if (res.failed())   {
                        LOGGER.error("Unable to start operator for 1 or more namespace", res.cause());
                        new Thread(() -> System.exit(1), "SystemExit").start();
                    }

rlanhellas avatar Jul 04 '25 17:07 rlanhellas

@scholzj let me know if you have concerns regarding this solution, if don't i will open a small PR to fix it

rlanhellas avatar Jul 07 '25 14:07 rlanhellas

Why do we run in the Leader Election thread when we call leaderElection.start()? Should we instead fix that (to not mix the operator and LE threads) instead?

scholzj avatar Jul 08 '25 06:07 scholzj

CC @ppatierno @katheris

scholzj avatar Jul 08 '25 06:07 scholzj

@scholzj , when you call leaderElection.start() internally the kubernetes-client library spins up a new thread, this thread is responsible to renew leader election periodically.

When we become the new leader this thread trigger a callback and here where the Leader Election Thread happens.

This LE-Thread is waiting for us to reply the callback method, but we never do it when we call System.exit(1) because we call Thread.join() on LE-Thread, so it causes a deadlock.

rlanhellas avatar Jul 08 '25 13:07 rlanhellas

Here is a diagram showing the problem...

flowchart LR
      A[Operator]-- (1) starts -->B[leaderElectionStart]
      B[leaderElectionStart]-- (2) spins up -->C[LE Thread]
      C[LE Thread]-- (3) Waits Callback Return -->A[Operator]
      A[Operator]-- (4) -->D[SystemExit]
      D[SystemExit]-- (5) Thread Join -->C[LE Thread]

My suggestion...

flowchart LR
      A[Operator]-- (1) starts -->B[leaderElectionStart]
      B[leaderElectionStart]-- (2) spins up -->C[LE Thread]
      C[LE Thread]-- (3) Waits Callback Return -->A[Operator]
      A[Operator]-- (4) spins up -->D[Exit Thread]
      A[Operator]-- (4) returns -->C[LE Thread]
      D[Exit Thread]-- (5) -->E[SystemExit]

rlanhellas avatar Jul 08 '25 13:07 rlanhellas

@rlanhellas I've had a look through, but I'm not following this part:

This LE-Thread is waiting for us to reply the callback method, but we never do it when we call System.exit(1) because we call Thread.join() on LE-Thread, so it causes a deadlock.

We create a vert.x future for the leader election, but we don't move on to the call to createPlatformFeaturesAvailability until that callback completes and we succeed the future. So I don't follow why that makes the LE thread hang. In terms of the System.exit, you're refering to the one on line 91 of Main.java right? Not the one on line 227?

katheris avatar Jul 10 '25 15:07 katheris

@rlanhellas Do I get it right that you are suggesting that this line is causing the issue and we should call it from a separate thread? https://github.com/strimzi/strimzi-kafka-operator/blob/a53e3faa50075fff997e936676b084c866b14a47/cluster-operator/src/main/java/io/strimzi/operator/cluster/leaderelection/LeaderElectionManager.java#L89

scholzj avatar Jul 10 '25 20:07 scholzj

@rlanhellas I've had a look through, but I'm not following this part:

This LE-Thread is waiting for us to reply the callback method, but we never do it when we call System.exit(1) because we call Thread.join() on LE-Thread, so it causes a deadlock.

We create a vert.x future for the leader election, but we don't move on to the call to createPlatformFeaturesAvailability until that callback completes and we succeed the future. So I don't follow why that makes the LE thread hang. In terms of the System.exit, you're refering to the one on line 91 of Main.java right? Not the one on line 227?

Yes @katheris , I'm referring line 91. About your point ... When you receive the callback and call the promise.complete you're in the LE thread already, it was called by the async thread created inside Kubernetes Client lib. You're right, we don't move forward to createPlatformFeaturesAvailability until future completes, but this future is already inside LE thread. Here is what happen:

1- LE thread starts in Kubernetes Client Lib. 2- LE thread calls leader election callback (our operator). 3- Strimzi Operator calls promise.complete 4- Go to createPlatformFeaturesAvailability and fails 5- Calls System.exit(1), hangs here because it calls LEThread.join()

rlanhellas avatar Jul 11 '25 01:07 rlanhellas

@rlanhellas Do I get it right that you are suggesting that this line is causing the issue and we should call it from a separate thread?

strimzi-kafka-operator/cluster-operator/src/main/java/io/strimzi/operator/cluster/leaderelection/LeaderElectionManager.java

Line 89 in a53e3fa

leaderElectorFuture.join();

@scholzj. No, it happens inside JDK, exactly in this point: https://github.com/openjdk/jdk/blob/jdk-17%2B33/src/java.base/share/classes/java/lang/ApplicationShutdownHooks.java#L107.

When you call System.exit(1) , internally it wraps each hook in a Thread, and call hook.join(), and it causes deadlock in our case, because remember that System.exit is called by LEThread and our hook is asking to release LEThread lock, exactly in this line.

My suggestion is to call System.exit(1) in a different thread than LEThread, so wraps this call in a separate thread.

rlanhellas avatar Jul 11 '25 01:07 rlanhellas

So ... I am trying to understand ... We are in the main operator thread (Vert.x main event loop actually, right?) and it calls leaderElection as part of a compose Vert.x future chains which run in their own Vert.x threads. I got that the callback within the leaderElection is called within the LE-Thread and it completes the leader future allowing the Vert.x composition to move forward with next step ... calling createPlatformFeaturesAvailability (which is the method failing). The callback is done in the LE-Thread and not on the Vert.x event loop thread, so that even onComplete is called on such a thread and System.exit is called as well while they should be called outside of the LE-Thread.

@rlanhellas would you mind try to adding some log to trace Thread.currentThread() to have the evidence from which thread things are called? Because if the callback is called from the LE-Thread we could try to call the leader.complete(); within the callback on the Vert.x event loop with a runOnContext. But I would need the evidence of where things are called via the above additional tracing.

ppatierno avatar Jul 11 '25 08:07 ppatierno

@ppatierno , your understanding is correct regarding the problem.

I added some logs with prefix [RONALDO] showing the current thread for each step, here is:

2025-07-11 11:38:53 INFO  Main:86 - [RONALDO] Before calling startHealthServer. Thread: main
2025-07-11 11:38:53 INFO  Main:283 - Health and metrics server is ready on port 8080
2025-07-11 11:38:53 INFO  Main:244 - Waiting to become a leader
2025-07-11 11:38:53 INFO  Main:245 - [RONALDO] Before calling leaderElection.start(). Thread: vert.x-eventloop-thread-0
2025-07-11 11:38:53 INFO  LeaderElectionManager:69 - Starting the Leader Elector
2025-07-11 11:38:53 INFO  LeaderElectionManager:121 - The new leader is le2
2025-07-11 11:38:53 INFO  LeaderElectionManager:103 - Started being a leader
2025-07-11 11:38:53 INFO  Main:225 - I'm the new leader
2025-07-11 11:38:53 INFO  Main:111 - [RONALDO] Before Promise.promise() on createPlatformFeaturesAvailability. Thread: -1292784864-pool-3-thread-1
2025-07-11 11:38:53 INFO  Main:93 - [RONALDO] Before calling System.exit(1). Thread: -1292784864-pool-3-thread-1
2025-07-11 11:38:53 ERROR Main:94 - Unable to start operator for 1 or more namespace
java.lang.RuntimeException: Failed to parse Map from String
	at io.strimzi.operator.common.Util.parseMap(Util.java:104) ~[classes/:?]
	at io.strimzi.operator.cluster.PlatformFeaturesAvailability.parseVersionInfo(PlatformFeaturesAvailability.java:112) ~[classes/:?]
	at io.strimzi.operator.cluster.PlatformFeaturesAvailability.getVersionInfo(PlatformFeaturesAvailability.java:100) ~[classes/:?]
	at io.strimzi.operator.cluster.PlatformFeaturesAvailability.create(PlatformFeaturesAvailability.java:46) ~[classes/:?]
	at io.strimzi.operator.cluster.Main.createPlatformFeaturesAvailability(Main.java:114) ~[classes/:?]
	at io.strimzi.operator.cluster.Main.lambda$main$2(Main.java:89) ~[classes/:?]
	at io.vertx.core.impl.future.Composition.complete(Composition.java:40) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureBase.emitResult(FutureBase.java:68) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureImpl.completeInternal(FutureImpl.java:163) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureBase.emitResult(FutureBase.java:68) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureImpl.completeInternal(FutureImpl.java:163) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureImpl.tryComplete(FutureImpl.java:169) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.Promise.complete(Promise.java:87) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.strimzi.operator.cluster.Main.lambda$leaderElection$8(Main.java:226) ~[classes/:?]
	at io.strimzi.operator.cluster.leaderelection.LeaderElectionManager.startLeadershipHandler(LeaderElectionManager.java:104) ~[classes/:?]
	at io.fabric8.kubernetes.client.extended.leaderelection.LeaderCallbacks.onStartLeading(LeaderCallbacks.java:34) ~[kubernetes-client-api-7.3.1.jar:?]
	at io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.updateObserved(LeaderElector.java:266) ~[kubernetes-client-api-7.3.1.jar:?]
	at io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.tryAcquireOrRenew(LeaderElector.java:235) ~[kubernetes-client-api-7.3.1.jar:?]
	at io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.lambda$acquire$4(LeaderElector.java:177) ~[kubernetes-client-api-7.3.1.jar:?]
	at io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.lambda$loop$8(LeaderElector.java:292) ~[kubernetes-client-api-7.3.1.jar:?]
	at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at java.lang.Thread.run(Thread.java:840) ~[?:?]
2025-07-11 11:38:53 INFO  ShutdownHook:35 - Shutdown hook started
2025-07-11 11:38:53 INFO  LeaderElectionManager:82 - Stopping the Leader Elector

Now, putting the System.exit(1) in a new thread here is the logs:

2025-07-11 11:40:52 INFO  Main:86 - [RONALDO] Before calling startHealthServer. Thread: main
2025-07-11 11:40:52 INFO  Main:283 - Health and metrics server is ready on port 8080
2025-07-11 11:40:52 INFO  Main:244 - Waiting to become a leader
2025-07-11 11:40:52 INFO  Main:245 - [RONALDO] Before calling leaderElection.start(). Thread: vert.x-eventloop-thread-0
2025-07-11 11:40:52 INFO  LeaderElectionManager:69 - Starting the Leader Elector
2025-07-11 11:40:52 INFO  LeaderElectionManager:121 - The new leader is le2
2025-07-11 11:40:52 INFO  LeaderElectionManager:103 - Started being a leader
2025-07-11 11:40:52 INFO  Main:225 - I'm the new leader
2025-07-11 11:40:52 INFO  Main:111 - [RONALDO] Before Promise.promise() on createPlatformFeaturesAvailability. Thread: -2020602315-pool-3-thread-1
2025-07-11 11:40:52 INFO  Main:93 - [RONALDO] Before calling System.exit(1). Thread: -2020602315-pool-3-thread-1
2025-07-11 11:40:52 ERROR Main:94 - Unable to start operator for 1 or more namespace
java.lang.RuntimeException: Failed to parse Map from String
	at io.strimzi.operator.common.Util.parseMap(Util.java:104) ~[classes/:?]
	at io.strimzi.operator.cluster.PlatformFeaturesAvailability.parseVersionInfo(PlatformFeaturesAvailability.java:112) ~[classes/:?]
	at io.strimzi.operator.cluster.PlatformFeaturesAvailability.getVersionInfo(PlatformFeaturesAvailability.java:100) ~[classes/:?]
	at io.strimzi.operator.cluster.PlatformFeaturesAvailability.create(PlatformFeaturesAvailability.java:46) ~[classes/:?]
	at io.strimzi.operator.cluster.Main.createPlatformFeaturesAvailability(Main.java:114) ~[classes/:?]
	at io.strimzi.operator.cluster.Main.lambda$main$2(Main.java:89) ~[classes/:?]
	at io.vertx.core.impl.future.Composition.complete(Composition.java:40) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureBase.emitResult(FutureBase.java:68) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureImpl.completeInternal(FutureImpl.java:163) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureBase.emitResult(FutureBase.java:68) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureImpl.completeInternal(FutureImpl.java:163) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureImpl.tryComplete(FutureImpl.java:169) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.Promise.complete(Promise.java:87) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.strimzi.operator.cluster.Main.lambda$leaderElection$9(Main.java:226) ~[classes/:?]
	at io.strimzi.operator.cluster.leaderelection.LeaderElectionManager.startLeadershipHandler(LeaderElectionManager.java:104) ~[classes/:?]
	at io.fabric8.kubernetes.client.extended.leaderelection.LeaderCallbacks.onStartLeading(LeaderCallbacks.java:34) ~[kubernetes-client-api-7.3.1.jar:?]
	at io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.updateObserved(LeaderElector.java:266) ~[kubernetes-client-api-7.3.1.jar:?]
	at io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.tryAcquireOrRenew(LeaderElector.java:235) ~[kubernetes-client-api-7.3.1.jar:?]
	at io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.lambda$acquire$4(LeaderElector.java:177) ~[kubernetes-client-api-7.3.1.jar:?]
	at io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.lambda$loop$8(LeaderElector.java:292) ~[kubernetes-client-api-7.3.1.jar:?]
	at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at java.lang.Thread.run(Thread.java:840) ~[?:?]
2025-07-11 11:40:52 INFO  Main:227 - [RONALDO] After leader.complete() in callback. Thread: -2020602315-pool-3-thread-1
2025-07-11 11:40:52 INFO  ShutdownHook:35 - Shutdown hook started
2025-07-11 11:40:52 INFO  LeaderElectionManager:82 - Stopping the Leader Elector
2025-07-11 11:40:52 INFO  LeaderElectionManager:121 - The new leader is 
2025-07-11 11:40:52 INFO  LeaderElectionManager:112 - Stopped being a leader
2025-07-11 11:40:52 INFO  Main:237 - Stopped being a leader during a shutdown
2025-07-11 11:40:52 INFO  LeaderElectionManager:86 - Waiting for Leader Elector to stop
2025-07-11 11:40:52 INFO  LeaderElectionManager:94 - Leader Elector stopped
2025-07-11 11:40:52 INFO  ShutdownHook:60 - Shutting down Vert.x
2025-07-11 11:40:52 INFO  ShutdownHook:79 - Shutdown of Vert.x is complete
2025-07-11 11:40:52 INFO  ShutdownHook:41 - Shutdown hook completed

Attention to the following point ... Adding the System.exit to a new thread forces callback to return and now we have the log after the leader.complete():

2025-07-11 11:40:52 INFO  Main:227 - [RONALDO] After leader.complete() in callback. Thread: -2020602315-pool-3-thread-1

rlanhellas avatar Jul 11 '25 14:07 rlanhellas

So what's wrong for me is that the compose chain is called by using the LE-thread (in this case -2020602315-pool-3-thread-1). I would expect the next createPlatformFeaturesAvailability being called from the Vert.x event loop. When in Vert.x you create a Promise, it should capture the right Vert.x context and we should be in the Vert.x event loop on the leader promise creation, as I can understand from here:

2025-07-11 11:40:52 INFO  Main:245 - [RONALDO] Before calling leaderElection.start(). Thread: vert.x-eventloop-thread-0

not clear why on leader.complete() (even if called from the LE-thread, which is right), the next compose is not running in the Vert.x event loop because of the captured Vert.x context on Promise creation. Can you try to add some more log by tracking Vertx.currentContext() together with the thread information? and get the log? After that, can you force to call the complete on the Vert.x context with:

vertx.runOnContext(v -> {
    leader.complete();
});

and getting the same log and see differences (and of course check if it works). NOTE: No changes for System.exit(1). Don't use a different thread for it.

ppatierno avatar Jul 11 '25 15:07 ppatierno

@rlanhellas sorry, leaderElection method doesn't have the vertx instance but I guess you can do something like this:

Promise<Void> leader = Promise.promise();
// capture the correct Vert.x context
Context context = Vertx.currentContext();
...
// wrap the promise completion to run in the right Vert.x context so that all subsequent compose should be on the event loop
context.runOnContext(v -> {
    leader.complete();
  });

ppatierno avatar Jul 11 '25 15:07 ppatierno

Now the next compose() runs on vert.x event loop, it got stuck too but since it has timeout it closes after some time, here is the logs:

2025-07-11 12:41:06 INFO  Main:87 - [RONALDO] Before calling startHealthServer. Thread: main
2025-07-11 12:41:06 INFO  Main:287 - Health and metrics server is ready on port 8080
2025-07-11 12:41:06 INFO  Main:248 - Waiting to become a leader
2025-07-11 12:41:06 INFO  Main:249 - [RONALDO] Before calling leaderElection.start(). Thread: vert.x-eventloop-thread-0
2025-07-11 12:41:06 INFO  LeaderElectionManager:69 - Starting the Leader Elector
2025-07-11 12:41:06 INFO  LeaderElectionManager:121 - The new leader is 
2025-07-11 12:41:06 INFO  LeaderElectionManager:121 - The new leader is le2
2025-07-11 12:41:06 INFO  LeaderElectionManager:103 - Started being a leader
2025-07-11 12:41:06 INFO  Main:228 - I'm the new leader
2025-07-11 12:41:06 INFO  Main:231 - [RONALDO] After leader.complete() in callback. Thread: -221145867-pool-3-thread-1
2025-07-11 12:41:06 INFO  Main:112 - [RONALDO] Before Promise.promise() on createPlatformFeaturesAvailability. Thread: vert.x-eventloop-thread-0
2025-07-11 12:41:06 INFO  Main:94 - [RONALDO] Before calling System.exit(1). Thread: vert.x-eventloop-thread-0
2025-07-11 12:41:06 ERROR Main:95 - Unable to start operator for 1 or more namespace
java.lang.RuntimeException: Failed to parse Map from String
	at io.strimzi.operator.common.Util.parseMap(Util.java:104) ~[classes/:?]
	at io.strimzi.operator.cluster.PlatformFeaturesAvailability.parseVersionInfo(PlatformFeaturesAvailability.java:112) ~[classes/:?]
	at io.strimzi.operator.cluster.PlatformFeaturesAvailability.getVersionInfo(PlatformFeaturesAvailability.java:100) ~[classes/:?]
	at io.strimzi.operator.cluster.PlatformFeaturesAvailability.create(PlatformFeaturesAvailability.java:46) ~[classes/:?]
	at io.strimzi.operator.cluster.Main.createPlatformFeaturesAvailability(Main.java:115) ~[classes/:?]
	at io.strimzi.operator.cluster.Main.lambda$main$2(Main.java:90) ~[classes/:?]
	at io.vertx.core.impl.future.Composition.complete(Composition.java:40) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureBase.emitResult(FutureBase.java:68) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureImpl.completeInternal(FutureImpl.java:163) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureBase.emitResult(FutureBase.java:68) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureImpl.completeInternal(FutureImpl.java:163) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureImpl.tryComplete(FutureImpl.java:169) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.Promise.complete(Promise.java:87) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.strimzi.operator.cluster.Main.lambda$leaderElection$8(Main.java:229) ~[classes/:?]
	at io.vertx.core.internal.ContextInternal.dispatch(ContextInternal.java:249) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.internal.ContextInternal.dispatch(ContextInternal.java:231) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.internal.ContextInternal.lambda$runOnContext$0(ContextInternal.java:50) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:148) ~[netty-common-4.2.2.Final.jar:4.2.2.Final]
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:141) ~[netty-common-4.2.2.Final.jar:4.2.2.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:507) ~[netty-common-4.2.2.Final.jar:4.2.2.Final]
	at io.netty.channel.SingleThreadIoEventLoop.run(SingleThreadIoEventLoop.java:182) ~[netty-transport-4.2.2.Final.jar:4.2.2.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:1073) ~[netty-common-4.2.2.Final.jar:4.2.2.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.2.2.Final.jar:4.2.2.Final]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.2.2.Final.jar:4.2.2.Final]
	at java.lang.Thread.run(Thread.java:840) ~[?:?]
2025-07-11 12:41:06 INFO  ShutdownHook:35 - Shutdown hook started
2025-07-11 12:41:06 INFO  LeaderElectionManager:82 - Stopping the Leader Elector
2025-07-11 12:41:06 INFO  LeaderElectionManager:121 - The new leader is 
2025-07-11 12:41:06 INFO  LeaderElectionManager:112 - Stopped being a leader
2025-07-11 12:41:06 INFO  Main:241 - Stopped being a leader during a shutdown
2025-07-11 12:41:06 INFO  LeaderElectionManager:86 - Waiting for Leader Elector to stop
2025-07-11 12:41:06 INFO  LeaderElectionManager:94 - Leader Elector stopped
2025-07-11 12:41:06 INFO  ShutdownHook:60 - Shutting down Vert.x
2025-07-11 12:41:08 WARN  BlockedThreadChecker:72 - Thread vert.x-eventloop-thread-0 has been blocked for 2366 ms, time limit is 2000 ms
2025-07-11 12:41:09 WARN  BlockedThreadChecker:72 - Thread vert.x-eventloop-thread-0 has been blocked for 3368 ms, time limit is 2000 ms
2025-07-11 12:41:10 WARN  BlockedThreadChecker:72 - Thread vert.x-eventloop-thread-0 has been blocked for 4373 ms, time limit is 2000 ms
2025-07-11 12:41:11 WARN  BlockedThreadChecker:77 - Thread vert.x-eventloop-thread-0 has been blocked for 5376 ms, time limit is 2000 ms
io.vertx.core.VertxException: Thread blocked
	at java.lang.Object.wait(Native Method) ~[?:?]
	at java.lang.Thread.join(Thread.java:1313) ~[?:?]
	at java.lang.Thread.join(Thread.java:1381) ~[?:?]
	at java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:107) ~[?:?]
	at java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:46) ~[?:?]
	at java.lang.Shutdown.runHooks(Shutdown.java:130) ~[?:?]
	at java.lang.Shutdown.exit(Shutdown.java:173) ~[?:?]
	at java.lang.Runtime.exit(Runtime.java:115) ~[?:?]
	at java.lang.System.exit(System.java:1877) ~[?:?]
	at io.strimzi.operator.cluster.Main.lambda$main$4(Main.java:96) ~[classes/:?]
	at io.strimzi.operator.cluster.Main$$Lambda$480/0x00000070014ff638.handle(Unknown Source) ~[?:?]
	at io.vertx.core.impl.future.FutureImpl.lambda$onComplete$1(FutureImpl.java:103) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureImpl$$Lambda$473/0x00000070014fb008.complete(Unknown Source) ~[?:?]
	at io.vertx.core.impl.future.FutureBase.emitResult(FutureBase.java:68) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureImpl.completeInternal(FutureImpl.java:163) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.Composition$$Lambda$501/0x0000007001503808.complete(Unknown Source) ~[?:?]
	at io.vertx.core.impl.future.FutureBase.emitResult(FutureBase.java:68) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FailedFuture.addListener(FailedFuture.java:99) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.Composition.complete(Composition.java:48) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureBase.emitResult(FutureBase.java:68) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureImpl.completeInternal(FutureImpl.java:163) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.Composition.complete(Composition.java:45) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureBase.emitResult(FutureBase.java:68) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureImpl.completeInternal(FutureImpl.java:163) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.Composition$$Lambda$501/0x0000007001503808.complete(Unknown Source) ~[?:?]
	at io.vertx.core.impl.future.FutureBase.emitResult(FutureBase.java:68) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureImpl.completeInternal(FutureImpl.java:163) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureImpl.tryComplete(FutureImpl.java:169) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.Promise.complete(Promise.java:87) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.strimzi.operator.cluster.Main.lambda$leaderElection$8(Main.java:229) ~[classes/:?]
	at io.strimzi.operator.cluster.Main$$Lambda$683/0x0000007001594f10.handle(Unknown Source) ~[?:?]
	at io.vertx.core.internal.ContextInternal.dispatch(ContextInternal.java:249) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.internal.ContextInternal.dispatch(ContextInternal.java:231) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.internal.ContextInternal.lambda$runOnContext$0(ContextInternal.java:50) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.internal.ContextInternal$$Lambda$684/0x0000007001595138.run(Unknown Source) ~[?:?]
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:148) ~[netty-common-4.2.2.Final.jar:4.2.2.Final]
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:141) ~[netty-common-4.2.2.Final.jar:4.2.2.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:507) ~[netty-common-4.2.2.Final.jar:4.2.2.Final]
	at io.netty.channel.SingleThreadIoEventLoop.run(SingleThreadIoEventLoop.java:182) ~[netty-transport-4.2.2.Final.jar:4.2.2.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:1073) ~[netty-common-4.2.2.Final.jar:4.2.2.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.2.2.Final.jar:4.2.2.Final]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.2.2.Final.jar:4.2.2.Final]
	at java.lang.Thread.run(Thread.java:840) ~[?:?]
2025-07-11 12:41:12 WARN  BlockedThreadChecker:77 - Thread vert.x-eventloop-thread-0 has been blocked for 6381 ms, time limit is 2000 ms
io.vertx.core.VertxException: Thread blocked
	at java.lang.Object.wait(Native Method) ~[?:?]
	at java.lang.Thread.join(Thread.java:1313) ~[?:?]
	at java.lang.Thread.join(Thread.java:1381) ~[?:?]
	at java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:107) ~[?:?]
	at java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:46) ~[?:?]
	at java.lang.Shutdown.runHooks(Shutdown.java:130) ~[?:?]
	at java.lang.Shutdown.exit(Shutdown.java:173) ~[?:?]
	at java.lang.Runtime.exit(Runtime.java:115) ~[?:?]
	at java.lang.System.exit(System.java:1877) ~[?:?]
	at io.strimzi.operator.cluster.Main.lambda$main$4(Main.java:96) ~[classes/:?]
	at io.strimzi.operator.cluster.Main$$Lambda$480/0x00000070014ff638.handle(Unknown Source) ~[?:?]
	at io.vertx.core.impl.future.FutureImpl.lambda$onComplete$1(FutureImpl.java:103) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureImpl$$Lambda$473/0x00000070014fb008.complete(Unknown Source) ~[?:?]
	at io.vertx.core.impl.future.FutureBase.emitResult(FutureBase.java:68) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureImpl.completeInternal(FutureImpl.java:163) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.Composition$$Lambda$501/0x0000007001503808.complete(Unknown Source) ~[?:?]
	at io.vertx.core.impl.future.FutureBase.emitResult(FutureBase.java:68) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FailedFuture.addListener(FailedFuture.java:99) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.Composition.complete(Composition.java:48) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureBase.emitResult(FutureBase.java:68) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureImpl.completeInternal(FutureImpl.java:163) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.Composition.complete(Composition.java:45) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureBase.emitResult(FutureBase.java:68) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureImpl.completeInternal(FutureImpl.java:163) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.Composition$$Lambda$501/0x0000007001503808.complete(Unknown Source) ~[?:?]
	at io.vertx.core.impl.future.FutureBase.emitResult(FutureBase.java:68) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureImpl.completeInternal(FutureImpl.java:163) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureImpl.tryComplete(FutureImpl.java:169) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.Promise.complete(Promise.java:87) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.strimzi.operator.cluster.Main.lambda$leaderElection$8(Main.java:229) ~[classes/:?]
	at io.strimzi.operator.cluster.Main$$Lambda$683/0x0000007001594f10.handle(Unknown Source) ~[?:?]
	at io.vertx.core.internal.ContextInternal.dispatch(ContextInternal.java:249) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.internal.ContextInternal.dispatch(ContextInternal.java:231) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.internal.ContextInternal.lambda$runOnContext$0(ContextInternal.java:50) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.internal.ContextInternal$$Lambda$684/0x0000007001595138.run(Unknown Source) ~[?:?]
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:148) ~[netty-common-4.2.2.Final.jar:4.2.2.Final]
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:141) ~[netty-common-4.2.2.Final.jar:4.2.2.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:507) ~[netty-common-4.2.2.Final.jar:4.2.2.Final]
	at io.netty.channel.SingleThreadIoEventLoop.run(SingleThreadIoEventLoop.java:182) ~[netty-transport-4.2.2.Final.jar:4.2.2.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:1073) ~[netty-common-4.2.2.Final.jar:4.2.2.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.2.2.Final.jar:4.2.2.Final]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.2.2.Final.jar:4.2.2.Final]
	at java.lang.Thread.run(Thread.java:840) ~[?:?]
2025-07-11 12:41:13 WARN  BlockedThreadChecker:77 - Thread vert.x-eventloop-thread-0 has been blocked for 7383 ms, time limit is 2000 ms
io.vertx.core.VertxException: Thread blocked
	at java.lang.Object.wait(Native Method) ~[?:?]
	at java.lang.Thread.join(Thread.java:1313) ~[?:?]
	at java.lang.Thread.join(Thread.java:1381) ~[?:?]
	at java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:107) ~[?:?]
	at java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:46) ~[?:?]
	at java.lang.Shutdown.runHooks(Shutdown.java:130) ~[?:?]
	at java.lang.Shutdown.exit(Shutdown.java:173) ~[?:?]
	at java.lang.Runtime.exit(Runtime.java:115) ~[?:?]
	at java.lang.System.exit(System.java:1877) ~[?:?]
	at io.strimzi.operator.cluster.Main.lambda$main$4(Main.java:96) ~[classes/:?]
	at io.strimzi.operator.cluster.Main$$Lambda$480/0x00000070014ff638.handle(Unknown Source) ~[?:?]
	at io.vertx.core.impl.future.FutureImpl.lambda$onComplete$1(FutureImpl.java:103) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureImpl$$Lambda$473/0x00000070014fb008.complete(Unknown Source) ~[?:?]
	at io.vertx.core.impl.future.FutureBase.emitResult(FutureBase.java:68) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureImpl.completeInternal(FutureImpl.java:163) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.Composition$$Lambda$501/0x0000007001503808.complete(Unknown Source) ~[?:?]
	at io.vertx.core.impl.future.FutureBase.emitResult(FutureBase.java:68) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FailedFuture.addListener(FailedFuture.java:99) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.Composition.complete(Composition.java:48) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureBase.emitResult(FutureBase.java:68) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureImpl.completeInternal(FutureImpl.java:163) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.Composition.complete(Composition.java:45) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureBase.emitResult(FutureBase.java:68) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureImpl.completeInternal(FutureImpl.java:163) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.Composition$$Lambda$501/0x0000007001503808.complete(Unknown Source) ~[?:?]
	at io.vertx.core.impl.future.FutureBase.emitResult(FutureBase.java:68) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureImpl.completeInternal(FutureImpl.java:163) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureImpl.tryComplete(FutureImpl.java:169) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.Promise.complete(Promise.java:87) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.strimzi.operator.cluster.Main.lambda$leaderElection$8(Main.java:229) ~[classes/:?]
	at io.strimzi.operator.cluster.Main$$Lambda$683/0x0000007001594f10.handle(Unknown Source) ~[?:?]
	at io.vertx.core.internal.ContextInternal.dispatch(ContextInternal.java:249) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.internal.ContextInternal.dispatch(ContextInternal.java:231) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.internal.ContextInternal.lambda$runOnContext$0(ContextInternal.java:50) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.internal.ContextInternal$$Lambda$684/0x0000007001595138.run(Unknown Source) ~[?:?]
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:148) ~[netty-common-4.2.2.Final.jar:4.2.2.Final]
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:141) ~[netty-common-4.2.2.Final.jar:4.2.2.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:507) ~[netty-common-4.2.2.Final.jar:4.2.2.Final]
	at io.netty.channel.SingleThreadIoEventLoop.run(SingleThreadIoEventLoop.java:182) ~[netty-transport-4.2.2.Final.jar:4.2.2.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:1073) ~[netty-common-4.2.2.Final.jar:4.2.2.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.2.2.Final.jar:4.2.2.Final]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.2.2.Final.jar:4.2.2.Final]
	at java.lang.Thread.run(Thread.java:840) ~[?:?]
2025-07-11 12:41:14 WARN  BlockedThreadChecker:77 - Thread vert.x-eventloop-thread-0 has been blocked for 8385 ms, time limit is 2000 ms
io.vertx.core.VertxException: Thread blocked
	at java.lang.Object.wait(Native Method) ~[?:?]
	at java.lang.Thread.join(Thread.java:1313) ~[?:?]
	at java.lang.Thread.join(Thread.java:1381) ~[?:?]
	at java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:107) ~[?:?]
	at java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:46) ~[?:?]
	at java.lang.Shutdown.runHooks(Shutdown.java:130) ~[?:?]
	at java.lang.Shutdown.exit(Shutdown.java:173) ~[?:?]
	at java.lang.Runtime.exit(Runtime.java:115) ~[?:?]
	at java.lang.System.exit(System.java:1877) ~[?:?]
	at io.strimzi.operator.cluster.Main.lambda$main$4(Main.java:96) ~[classes/:?]
	at io.strimzi.operator.cluster.Main$$Lambda$480/0x00000070014ff638.handle(Unknown Source) ~[?:?]
	at io.vertx.core.impl.future.FutureImpl.lambda$onComplete$1(FutureImpl.java:103) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureImpl$$Lambda$473/0x00000070014fb008.complete(Unknown Source) ~[?:?]
	at io.vertx.core.impl.future.FutureBase.emitResult(FutureBase.java:68) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureImpl.completeInternal(FutureImpl.java:163) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.Composition$$Lambda$501/0x0000007001503808.complete(Unknown Source) ~[?:?]
	at io.vertx.core.impl.future.FutureBase.emitResult(FutureBase.java:68) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FailedFuture.addListener(FailedFuture.java:99) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.Composition.complete(Composition.java:48) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureBase.emitResult(FutureBase.java:68) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureImpl.completeInternal(FutureImpl.java:163) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.Composition.complete(Composition.java:45) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureBase.emitResult(FutureBase.java:68) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureImpl.completeInternal(FutureImpl.java:163) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.Composition$$Lambda$501/0x0000007001503808.complete(Unknown Source) ~[?:?]
	at io.vertx.core.impl.future.FutureBase.emitResult(FutureBase.java:68) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureImpl.completeInternal(FutureImpl.java:163) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureImpl.tryComplete(FutureImpl.java:169) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.Promise.complete(Promise.java:87) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.strimzi.operator.cluster.Main.lambda$leaderElection$8(Main.java:229) ~[classes/:?]
	at io.strimzi.operator.cluster.Main$$Lambda$683/0x0000007001594f10.handle(Unknown Source) ~[?:?]
	at io.vertx.core.internal.ContextInternal.dispatch(ContextInternal.java:249) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.internal.ContextInternal.dispatch(ContextInternal.java:231) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.internal.ContextInternal.lambda$runOnContext$0(ContextInternal.java:50) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.internal.ContextInternal$$Lambda$684/0x0000007001595138.run(Unknown Source) ~[?:?]
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:148) ~[netty-common-4.2.2.Final.jar:4.2.2.Final]
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:141) ~[netty-common-4.2.2.Final.jar:4.2.2.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:507) ~[netty-common-4.2.2.Final.jar:4.2.2.Final]
	at io.netty.channel.SingleThreadIoEventLoop.run(SingleThreadIoEventLoop.java:182) ~[netty-transport-4.2.2.Final.jar:4.2.2.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:1073) ~[netty-common-4.2.2.Final.jar:4.2.2.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.2.2.Final.jar:4.2.2.Final]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.2.2.Final.jar:4.2.2.Final]
	at java.lang.Thread.run(Thread.java:840) ~[?:?]
2025-07-11 12:41:15 WARN  BlockedThreadChecker:77 - Thread vert.x-eventloop-thread-0 has been blocked for 9387 ms, time limit is 2000 ms
io.vertx.core.VertxException: Thread blocked
	at java.lang.Object.wait(Native Method) ~[?:?]
	at java.lang.Thread.join(Thread.java:1313) ~[?:?]
	at java.lang.Thread.join(Thread.java:1381) ~[?:?]
	at java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:107) ~[?:?]
	at java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:46) ~[?:?]
	at java.lang.Shutdown.runHooks(Shutdown.java:130) ~[?:?]
	at java.lang.Shutdown.exit(Shutdown.java:173) ~[?:?]
	at java.lang.Runtime.exit(Runtime.java:115) ~[?:?]
	at java.lang.System.exit(System.java:1877) ~[?:?]
	at io.strimzi.operator.cluster.Main.lambda$main$4(Main.java:96) ~[classes/:?]
	at io.strimzi.operator.cluster.Main$$Lambda$480/0x00000070014ff638.handle(Unknown Source) ~[?:?]
	at io.vertx.core.impl.future.FutureImpl.lambda$onComplete$1(FutureImpl.java:103) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureImpl$$Lambda$473/0x00000070014fb008.complete(Unknown Source) ~[?:?]
	at io.vertx.core.impl.future.FutureBase.emitResult(FutureBase.java:68) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureImpl.completeInternal(FutureImpl.java:163) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.Composition$$Lambda$501/0x0000007001503808.complete(Unknown Source) ~[?:?]
	at io.vertx.core.impl.future.FutureBase.emitResult(FutureBase.java:68) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FailedFuture.addListener(FailedFuture.java:99) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.Composition.complete(Composition.java:48) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureBase.emitResult(FutureBase.java:68) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureImpl.completeInternal(FutureImpl.java:163) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.Composition.complete(Composition.java:45) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureBase.emitResult(FutureBase.java:68) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureImpl.completeInternal(FutureImpl.java:163) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.Composition$$Lambda$501/0x0000007001503808.complete(Unknown Source) ~[?:?]
	at io.vertx.core.impl.future.FutureBase.emitResult(FutureBase.java:68) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureImpl.completeInternal(FutureImpl.java:163) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.impl.future.FutureImpl.tryComplete(FutureImpl.java:169) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.Promise.complete(Promise.java:87) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.strimzi.operator.cluster.Main.lambda$leaderElection$8(Main.java:229) ~[classes/:?]
	at io.strimzi.operator.cluster.Main$$Lambda$683/0x0000007001594f10.handle(Unknown Source) ~[?:?]
	at io.vertx.core.internal.ContextInternal.dispatch(ContextInternal.java:249) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.internal.ContextInternal.dispatch(ContextInternal.java:231) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.internal.ContextInternal.lambda$runOnContext$0(ContextInternal.java:50) ~[vertx-core-5.0.1.jar:5.0.1]
	at io.vertx.core.internal.ContextInternal$$Lambda$684/0x0000007001595138.run(Unknown Source) ~[?:?]
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:148) ~[netty-common-4.2.2.Final.jar:4.2.2.Final]
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:141) ~[netty-common-4.2.2.Final.jar:4.2.2.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:507) ~[netty-common-4.2.2.Final.jar:4.2.2.Final]
	at io.netty.channel.SingleThreadIoEventLoop.run(SingleThreadIoEventLoop.java:182) ~[netty-transport-4.2.2.Final.jar:4.2.2.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:1073) ~[netty-common-4.2.2.Final.jar:4.2.2.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.2.2.Final.jar:4.2.2.Final]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.2.2.Final.jar:4.2.2.Final]
	at java.lang.Thread.run(Thread.java:840) ~[?:?]
2025-07-11 12:41:16 ERROR ShutdownHook:73 - Timed out while waiting for Vert.x close
2025-07-11 12:41:16 INFO  ShutdownHook:79 - Shutdown of Vert.x is complete
2025-07-11 12:41:16 INFO  ShutdownHook:41 - Shutdown hook completed

rlanhellas avatar Jul 11 '25 15:07 rlanhellas

@rlanhellas did you delete a comment about working when setting longer timeout on the event loop?

I think we have re-established the right thread execution by having the compose chain being executed in the event loop, leaving the LE-thread. The problem is the duration of the System.exit(1) which is taking long because it calls hooks to stop the leader election stuff. Maybe it should be called on a Vert.x worker thread with an execute blocking call.

ppatierno avatar Jul 11 '25 16:07 ppatierno

@ppatierno , I added this

        options.setMaxEventLoopExecuteTime(TimeUnit.SECONDS.toNanos(10));

We don't see Thread blocked anymore, but we keep with Timed out while waiting for Vert.x close . It happens because of this:

private static final long SHUTDOWN_TIMEOUT 

set to 10seconds. But not matter how many time I set it keep stuck on event loop thread.

rlanhellas avatar Jul 11 '25 16:07 rlanhellas

@ppatierno , We fall in deadlock too , because now System.exit runs on Vert.x event loop , but it calls the Thread.join on shutdownVertx which waits to vertx.close(), but it will never closes since Thread.join() is locking it.

So, we need to execute System.exit(1) in a worker thread in vert.x as you mentioned, here is my suggestion:

 vertx.executeBlocking(() -> {
                            System.exit(1);
                            return true;
                        });

rlanhellas avatar Jul 11 '25 18:07 rlanhellas

Of course, it's going to work because it's running System.exit(1) in a different thread as your first suggestion but by using a thread from the Vert.x worker threads pool instead of creating a new one. It would work even without running the compose chain in the Vert.x event loop imho but in any case I think that's an error, we should always switch from a callback thread to the event loop when using Vert.x.

ppatierno avatar Jul 12 '25 05:07 ppatierno

I agree @ppatierno , we should keep everything on event loop . I will open a PR with these changes , thanks for your help .

rlanhellas avatar Jul 12 '25 11:07 rlanhellas