strimzi-kafka-operator
strimzi-kafka-operator copied to clipboard
Kafka Exporter has **CrashLoopBackOff** and can't recover.
Please use this to only for bug reports. For questions or when you need help, you can use the GitHub Discussions, our #strimzi Slack channel or out user mailing list.
Describe the bug Kafka Exporter has CrashLoopBackOff and can't recover.
To Reproduce Sometimes, when I create Kafka Cluster and it works. But after a while, the Kafka Exporter has CrashLoopBackOff and is always in this status.
Readiness probe failed: Get "http://10.130.0.49:9404/metrics": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Expected behavior A clear and concise description of what you expected to happen.
Environment (please complete the following information):
- Strimzi version: main
- Installation method: OperatorHub
- Kubernetes cluster: Openshift 4.9
- Infrastructure: BareMetal
YAML files and logs
[kafka_exporter] [INFO] 2022/01/04 07:38:01 Starting kafka_exporter (version=1.3.1.redhat-00001, branch=master, revision=eb1f5c4229ce4ca51d64d2034926ce64c60e05e9)
[kafka_exporter] [INFO] 2022/01/04 07:38:01 Build context (go=go1.13, user=worker@pnc-ba-pod-4c2d6e, date=20210708-16:03:34)
[kafka_exporter] [INFO] 2022/01/04 07:38:01 Done Init Clients
[kafka_exporter] [INFO] 2022/01/04 07:38:01 Listening on :9404
[kafka_exporter] [INFO] 2022/01/04 07:38:02 Refreshing client metadata
[kafka_exporter] [INFO] 2022/01/04 07:38:05 concurrent calls detected, waiting for first to finish
[kafka_exporter] [INFO] 2022/01/04 07:38:17 concurrent calls detected, waiting for first to finish
[kafka_exporter] [INFO] 2022/01/04 07:38:32 concurrent calls detected, waiting for first to finish
[kafka_exporter] [INFO] 2022/01/04 07:38:35 concurrent calls detected, waiting for first to finish
[kafka_exporter] [INFO] 2022/01/04 07:38:45 concurrent calls detected, waiting for first to finish
[kafka_exporter] [INFO] 2022/01/04 07:38:47 concurrent calls detected, waiting for first to finish
[kafka_exporter] [INFO] 2022/01/04 07:39:05 concurrent calls detected, waiting for first to finish
[kafka_exporter] [INFO] 2022/01/04 07:39:15 concurrent calls detected, waiting for first to finish
[kafka_exporter] [INFO] 2022/01/04 07:39:15 concurrent calls detected, waiting for first to finish
@alesj you did some work on this, any idea? I see it's not upstream Strimzi as well, but an older Red Hat AMQ Streams version (1.3.1). Not sure if your improvements are included in that version and they fixed this problem.
@alesj you did some work on this, any idea? I see it's not upstream Strimzi as well, but an older Red Hat AMQ Streams version (1.3.1). Not sure if your improvements are included in that version and they fixed this problem.
I'm using Red Hat AMQ Streams version 1.8.4. Btw, I see that this bug usually happens when I create cluster with 1 broker and 1 zookeeper.
I'm using Red Hat AMQ Streams version 1.8.4.
Yeah sorry, 1.3.1 is clearly the kafka-exporter version not the AMQ Streams one. I think that version should have some improvements from @alesj
Is everything up and running and working correctly from Kafka clients point of view when you see this message in the Kafka exporter? I mean, is ZK and Kafka pods up and running? Are you able to use Kafka clients to exchange messages without problems. The Kafka exporter should run some admin calls against the Kafka cluster, not sure if some problems on the cluster side are raising the issue in the exporter.
Is everything up and running and working correctly from Kafka clients point of view when you see this message in the Kafka exporter?
Yes, the Kafka Broker and Zk pods are still running. I can use Kafka clients to produce and consume messages without any problems. There is only Kafka exporter had crash.
If you are not using Strimzi, you should probably not raise it here but raise this with Red Hat support instead.
I tried to send message to Kafka Cluster and after a while, the Kafka Exporter works again.
I found a reply from author and he said that this is not bug. https://issueexplorer.com/issue/danielqsj/kafka_exporter/259
I want to enable concurrent.enable flag for Kafka exporter. Is there anyway to do it?
If you are not using Strimzi, you should probably not raise it here but raise this with Red Hat support instead.
I change to use Strimzi 0.27.0 and I still meet this issue. This is my cluster template.
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
name: my-cluster
spec:
entityOperator:
userOperator: {}
kafka:
authorization:
type: simple
config:
inter.broker.protocol.version: '2.8'
delete.topic.enable: true
socket.request.max.bytes: 10485760
client.quota.callback.class: io.strimzi.kafka.quotas.StaticQuotaCallback
transaction.state.log.replication.factor: 2
queued.max.requests: 100
client.quota.callback.static.fetch: 104857600
client.quota.callback.static.produce: 104857600
log.message.format.version: '2.8'
transaction.state.log.min.isr: 1
replica.fetch.max.bytes: 10485760
max.message.bytes: 5242880
offsets.topic.replication.factor: 2
listeners:
- authentication:
type: scram-sha-512
name: plain
port: 9092
tls: false
type: internal
- authentication:
type: scram-sha-512
name: tls
port: 9093
tls: true
type: internal
- authentication:
type: scram-sha-512
name: external
port: 9094
tls: true
type: route
metricsConfig:
type: jmxPrometheusExporter
valueFrom:
configMapKeyRef:
key: kafka-metrics-config.yml
name: kafka-metrics
replicas: 2
storage:
type: ephemeral
version: 2.8.0
kafkaExporter:
groupRegex: .*
topicRegex: .*
zookeeper:
metricsConfig:
type: jmxPrometheusExporter
valueFrom:
configMapKeyRef:
key: zookeeper-metrics-config.yml
name: kafka-metrics
replicas: 2
storage:
type: ephemeral
So, can you please share the log from Strimzi 0.27.0?
So, can you please share the log from Strimzi 0.27.0?
This is the log from kafka-exporter
+ exec /usr/bin/tini -w -e 143 -- /tmp/run.sh
I0110 10:09:55.957724 12 kafka_exporter.go:769] Starting kafka_exporter (version=1.4.2, branch=HEAD, revision=0d5d4ac4ba63948748cc2c53b35ed95c310cd6f2)
I0110 10:09:56.070116 12 kafka_exporter.go:929] Listening on HTTP :9404
It always get this error
Readiness probe failed: Get "http://10.131.1.52:9404/metrics": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Idk where to get the log of Strimzi? Can you tell me how?
Did you give it enough resources (CPU and memory)?
Did you give it enough resources (CPU and memory)?
I don't set the limit resources for Kafka-exporter so that it can allocate unlimited resources.
Well, there can be some resources set automatically through LimitRange etc.. It is also unlimited if not set, but that doesn't answer what is available. That is why it is always a good idea to set the resources. I normally give it this in my cluster:
resources:
requests:
memory: 256Mi
cpu: "0.5"
limits:
memory: 256Mi
cpu: "0.5"
But it has just a few topics. It might depend on you cluster size what is appropriatte.
I saw that the kafka-exporter use ~10M memory and 100m CPU before it died. I will try to set the resources for kafka-exporter and report the result later.
Well, there can be some resources set automatically through LimitRange etc.. It is also unlimited if not set, but that doesn't answer what is available. That is why it is always a good idea to set the resources. I normally give it this in my cluster:
resources: requests: memory: 256Mi cpu: "0.5" limits: memory: 256Mi cpu: "0.5"
But it has just a few topics. It might depend on you cluster size what is appropriatte.
I still get the error even when I set the limit resources for Kafka Exporter.
I don't know than I'm afraid.
I am also facing the same issue. Any updates on this issue?
I'm afraid not. I have to skip kafka-exporter to avoid it.
does not give enough information to troubleshoot , i am using strimzi0.27 , and cannot get the exporter to be stable , seems to be crashlooping always,
Liveness probe failed: Get "http://10.225.18.21:9404/metrics": context deadline exceeded (Client.Timeout exceeded while awaiting headers) Back-off restarting failed container
@scholzj , could you please help
this may be something to do with prometheus , i am running exporter without any resources , so its unlimited , i will verify on prometheus side as well
No resources does not mean unlimited. There could be a LimitRange to set some defaults. And even if it wouldn't be, it means that it might get only little resources which are left because without setting the resources Kube will not know properly what it needs and where to schedule it.
@scholzj As my comment above, I set the limit resources but still got this error. The kafka-exporter consumes very low resources so I think that resource is not reason.
Yeah, I do not know what the problem is I'm afraid.
I have hit the same problem and did some investigation , found sort of the culprint of the issue even though i am not sure what is causing it .
For reference i am running strimzi 0.27.0
and a kafka cluster 2.8.1
In my case the timeouts
on the kafka-exporter ( which seems to be about metadata refreshing
) seems related to having the topic operator running or not .
- When the cluster is just created and no topics exist
kafka-exporter works
- as soon as i create a topic withtout the topic operator running kafka-exporter starts to timeout
- If the topic operator is running, and so the
strimzi topics are created
the kafka-exporter works and never timeout even when extra topics are created by hand and not through the KafkaTopic CRD
Steps to reproduce in my case
- create kafka without Entity Operator
- No topics exits in the cluster
$ kubectl exec -t -i test-kafka-0 -c kafka -- ./bin/kafka-topics.sh --bootstrap-server localhost:9093 --list
#
- kafka-exporter works fine
- create topic manually
$ kubectl exec -t -i test-kafka-0 -c kafka -- ./bin/kafka-topics.sh --bootstrap-server localhost:9093 --create --topic test
Created topic test.
$ kubectl exec -t -i test-kafka-0 -c kafka -- ./bin/kafka-topics.sh --bootstrap-server localhost:9093 --list
test
- kafka-exporter starts to timeout on refreshing metadata
test-kafka-exporter-54d7cb8558-4nrkv test-kafka-exporter I0610 13:19:12.711584 11 kafka_exporter.go:366] Refreshing client metadata
test-kafka-exporter-54d7cb8558-4nrkv telegraf 2022-06-10T13:20:03Z E! [inputs.prometheus] Error in plugin: error making HTTP request to http://127.0.0.1:9404/metrics: Get "http://127.0.0.1:9404/metrics": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
- delete topic manually
$ kubectl exec -t -i test-kafka-0 -c kafka -- ./bin/kafka-topics.sh --bootstrap-server localhost:9093 --delete --topic test
$ kubectl exec -t -i test-kafka-0 -c kafka -- ./bin/kafka-topics.sh --bootstrap-server localhost:9093 --list
- kafka-expoter works fine
- add topic operator to kafka resource
- strimzi topics are created
$ kubectl exec -t -i test-kafka-0 -c kafka -- ./bin/kafka-topics.sh --bootstrap-server localhost:9093 --list
__consumer_offsets
__strimzi-topic-operator-kstreams-topic-store-changelog
__strimzi_store_topic
- kafka-exporte works fine
- create topic manually as before
$ kubectl exec -t -i test-kafka-0 -c kafka -- ./bin/kafka-topics.sh --bootstrap-server localhost:9093 --create --topic test
Created topic test.
$ kubectl exec -t -i test-kafka-0 -c kafka -- ./bin/kafka-topics.sh --bootstrap-server localhost:9093 --list
__consumer_offsets
__strimzi-topic-operator-kstreams-topic-store-changelog
__strimzi_store_topic
test
- kafka-exporte works fine
edit:
I think that in my case this is caused by the lack of __consumer_offsets
topic and that must make kafka-exporter
upset for some reason
this topic is created by kafka when the first client using consumer management API connects.
I checked on other cluster wher i don't have the entity operator and where i knew i did not have the issue ... but there i am running a test mirror maker which is triggering the creation of the __consumer_offstes topic
@primeroz The Kafka Exporter will need to collect the committed offsets form there to collect the lag. So in some way it makes sense that it needs this topic and without any consumers, the lag monitoring makes little sense. But I can see how it can behave in some nicer way as well. Maybe this is something you can raise to the Kafka Exporter project?
yes was planning to , in the mantime i came here because i remembered this issue existed when i noticed this issue on one of our clusters and so i thought it would be good to update here.
I don't undertand why the kafka-exporter does not trigger the creation of the topic but that's above my skillset
Now i really jsut wanted to make sure there no issues with strimzi sinc ei had deployed quite a few of them ! :)
I don't think it really uses the consumer groups. It just reads and decodes the content of it. So it does not trigger its creation.
Triaged on 2.8.2022: This has to be fixed in the Kafka Exporter. Once it has a new release, we can update Strimzi to use it. We should also add a warning to the docs about this problem (=> e.g. something like If you don't use consumer groups, it will not work ... just with more fancy wording :-o). It can be added for example somewhere here: https://strimzi.io/docs/operators/latest/deploying.html#con-metrics-kafka-exporter-lag-str
CC @PaulRMellor ^^^
We need to also re-open the discussion about the Kafka Exporter future since it is a long time since a new release or some fixed issues.
Hi @scholzj -- I'll add something to the docs.