strimzi-kafka-operator icon indicating copy to clipboard operation
strimzi-kafka-operator copied to clipboard

Kafka Exporter has **CrashLoopBackOff** and can't recover.

Open nautiam opened this issue 2 years ago • 28 comments

Please use this to only for bug reports. For questions or when you need help, you can use the GitHub Discussions, our #strimzi Slack channel or out user mailing list.

Describe the bug Kafka Exporter has CrashLoopBackOff and can't recover.

To Reproduce Sometimes, when I create Kafka Cluster and it works. But after a while, the Kafka Exporter has CrashLoopBackOff and is always in this status.

Readiness probe failed: Get "http://10.130.0.49:9404/metrics": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Expected behavior A clear and concise description of what you expected to happen.

Environment (please complete the following information):

  • Strimzi version: main
  • Installation method: OperatorHub
  • Kubernetes cluster: Openshift 4.9
  • Infrastructure: BareMetal

YAML files and logs

[kafka_exporter] [INFO] 2022/01/04 07:38:01 Starting kafka_exporter (version=1.3.1.redhat-00001, branch=master, revision=eb1f5c4229ce4ca51d64d2034926ce64c60e05e9)
[kafka_exporter] [INFO] 2022/01/04 07:38:01 Build context (go=go1.13, user=worker@pnc-ba-pod-4c2d6e, date=20210708-16:03:34)
[kafka_exporter] [INFO] 2022/01/04 07:38:01 Done Init Clients
[kafka_exporter] [INFO] 2022/01/04 07:38:01 Listening on :9404
[kafka_exporter] [INFO] 2022/01/04 07:38:02 Refreshing client metadata
[kafka_exporter] [INFO] 2022/01/04 07:38:05 concurrent calls detected, waiting for first to finish
[kafka_exporter] [INFO] 2022/01/04 07:38:17 concurrent calls detected, waiting for first to finish
[kafka_exporter] [INFO] 2022/01/04 07:38:32 concurrent calls detected, waiting for first to finish
[kafka_exporter] [INFO] 2022/01/04 07:38:35 concurrent calls detected, waiting for first to finish
[kafka_exporter] [INFO] 2022/01/04 07:38:45 concurrent calls detected, waiting for first to finish
[kafka_exporter] [INFO] 2022/01/04 07:38:47 concurrent calls detected, waiting for first to finish
[kafka_exporter] [INFO] 2022/01/04 07:39:05 concurrent calls detected, waiting for first to finish
[kafka_exporter] [INFO] 2022/01/04 07:39:15 concurrent calls detected, waiting for first to finish
[kafka_exporter] [INFO] 2022/01/04 07:39:15 concurrent calls detected, waiting for first to finish

nautiam avatar Jan 04 '22 07:01 nautiam

@alesj you did some work on this, any idea? I see it's not upstream Strimzi as well, but an older Red Hat AMQ Streams version (1.3.1). Not sure if your improvements are included in that version and they fixed this problem.

ppatierno avatar Jan 04 '22 08:01 ppatierno

@alesj you did some work on this, any idea? I see it's not upstream Strimzi as well, but an older Red Hat AMQ Streams version (1.3.1). Not sure if your improvements are included in that version and they fixed this problem.

I'm using Red Hat AMQ Streams version 1.8.4. Btw, I see that this bug usually happens when I create cluster with 1 broker and 1 zookeeper.

nautiam avatar Jan 04 '22 08:01 nautiam

I'm using Red Hat AMQ Streams version 1.8.4.

Yeah sorry, 1.3.1 is clearly the kafka-exporter version not the AMQ Streams one. I think that version should have some improvements from @alesj

Is everything up and running and working correctly from Kafka clients point of view when you see this message in the Kafka exporter? I mean, is ZK and Kafka pods up and running? Are you able to use Kafka clients to exchange messages without problems. The Kafka exporter should run some admin calls against the Kafka cluster, not sure if some problems on the cluster side are raising the issue in the exporter.

ppatierno avatar Jan 04 '22 09:01 ppatierno

Is everything up and running and working correctly from Kafka clients point of view when you see this message in the Kafka exporter?

Yes, the Kafka Broker and Zk pods are still running. I can use Kafka clients to produce and consume messages without any problems. There is only Kafka exporter had crash.

nautiam avatar Jan 04 '22 09:01 nautiam

If you are not using Strimzi, you should probably not raise it here but raise this with Red Hat support instead.

scholzj avatar Jan 04 '22 09:01 scholzj

I tried to send message to Kafka Cluster and after a while, the Kafka Exporter works again.

nautiam avatar Jan 04 '22 09:01 nautiam

I found a reply from author and he said that this is not bug. https://issueexplorer.com/issue/danielqsj/kafka_exporter/259

I want to enable concurrent.enable flag for Kafka exporter. Is there anyway to do it?

nautiam avatar Jan 04 '22 09:01 nautiam

If you are not using Strimzi, you should probably not raise it here but raise this with Red Hat support instead.

I change to use Strimzi 0.27.0 and I still meet this issue. This is my cluster template.

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: my-cluster
spec:
  entityOperator:
    userOperator: {}
  kafka:
    authorization:
      type: simple
    config:
      inter.broker.protocol.version: '2.8'
      delete.topic.enable: true
      socket.request.max.bytes: 10485760
      client.quota.callback.class: io.strimzi.kafka.quotas.StaticQuotaCallback
      transaction.state.log.replication.factor: 2
      queued.max.requests: 100
      client.quota.callback.static.fetch: 104857600
      client.quota.callback.static.produce: 104857600
      log.message.format.version: '2.8'
      transaction.state.log.min.isr: 1
      replica.fetch.max.bytes: 10485760
      max.message.bytes: 5242880
      offsets.topic.replication.factor: 2
    listeners:
      - authentication:
          type: scram-sha-512
        name: plain
        port: 9092
        tls: false
        type: internal
      - authentication:
          type: scram-sha-512
        name: tls
        port: 9093
        tls: true
        type: internal
      - authentication:
          type: scram-sha-512
        name: external
        port: 9094
        tls: true
        type: route
    metricsConfig:
      type: jmxPrometheusExporter
      valueFrom:
        configMapKeyRef:
          key: kafka-metrics-config.yml
          name: kafka-metrics
    replicas: 2
    storage:
      type: ephemeral
    version: 2.8.0
  kafkaExporter:
    groupRegex: .*
    topicRegex: .*
  zookeeper:
    metricsConfig:
      type: jmxPrometheusExporter
      valueFrom:
        configMapKeyRef:
          key: zookeeper-metrics-config.yml
          name: kafka-metrics
    replicas: 2
    storage:
      type: ephemeral

nautiam avatar Jan 10 '22 07:01 nautiam

So, can you please share the log from Strimzi 0.27.0?

scholzj avatar Jan 10 '22 10:01 scholzj

So, can you please share the log from Strimzi 0.27.0?

This is the log from kafka-exporter

+ exec /usr/bin/tini -w -e 143 -- /tmp/run.sh
I0110 10:09:55.957724      12 kafka_exporter.go:769] Starting kafka_exporter (version=1.4.2, branch=HEAD, revision=0d5d4ac4ba63948748cc2c53b35ed95c310cd6f2)
I0110 10:09:56.070116      12 kafka_exporter.go:929] Listening on HTTP :9404

It always get this error

Readiness probe failed: Get "http://10.131.1.52:9404/metrics": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Idk where to get the log of Strimzi? Can you tell me how?

nautiam avatar Jan 10 '22 10:01 nautiam

Did you give it enough resources (CPU and memory)?

scholzj avatar Jan 10 '22 10:01 scholzj

Did you give it enough resources (CPU and memory)?

I don't set the limit resources for Kafka-exporter so that it can allocate unlimited resources.

nautiam avatar Jan 10 '22 10:01 nautiam

Well, there can be some resources set automatically through LimitRange etc.. It is also unlimited if not set, but that doesn't answer what is available. That is why it is always a good idea to set the resources. I normally give it this in my cluster:

    resources:
      requests:
        memory: 256Mi
        cpu: "0.5"
      limits:
        memory: 256Mi
        cpu: "0.5"

But it has just a few topics. It might depend on you cluster size what is appropriatte.

scholzj avatar Jan 10 '22 10:01 scholzj

I saw that the kafka-exporter use ~10M memory and 100m CPU before it died. I will try to set the resources for kafka-exporter and report the result later.

nautiam avatar Jan 10 '22 10:01 nautiam

Well, there can be some resources set automatically through LimitRange etc.. It is also unlimited if not set, but that doesn't answer what is available. That is why it is always a good idea to set the resources. I normally give it this in my cluster:

    resources:
      requests:
        memory: 256Mi
        cpu: "0.5"
      limits:
        memory: 256Mi
        cpu: "0.5"

But it has just a few topics. It might depend on you cluster size what is appropriatte.

I still get the error even when I set the limit resources for Kafka Exporter.

nautiam avatar Jan 12 '22 08:01 nautiam

I don't know than I'm afraid.

scholzj avatar Jan 12 '22 08:01 scholzj

I am also facing the same issue. Any updates on this issue?

sdandamudi5c avatar Feb 28 '22 18:02 sdandamudi5c

I'm afraid not. I have to skip kafka-exporter to avoid it.

nautiam avatar Mar 01 '22 02:03 nautiam

does not give enough information to troubleshoot , i am using strimzi0.27 , and cannot get the exporter to be stable , seems to be crashlooping always,

image

Liveness probe failed: Get "http://10.225.18.21:9404/metrics": context deadline exceeded (Client.Timeout exceeded while awaiting headers) Back-off restarting failed container

@scholzj , could you please help

hari819 avatar Mar 15 '22 05:03 hari819

this may be something to do with prometheus , i am running exporter without any resources , so its unlimited , i will verify on prometheus side as well

hari819 avatar Mar 15 '22 05:03 hari819

No resources does not mean unlimited. There could be a LimitRange to set some defaults. And even if it wouldn't be, it means that it might get only little resources which are left because without setting the resources Kube will not know properly what it needs and where to schedule it.

scholzj avatar Mar 15 '22 09:03 scholzj

@scholzj As my comment above, I set the limit resources but still got this error. The kafka-exporter consumes very low resources so I think that resource is not reason.

nautiam avatar Mar 15 '22 09:03 nautiam

Yeah, I do not know what the problem is I'm afraid.

scholzj avatar Mar 15 '22 09:03 scholzj

I have hit the same problem and did some investigation , found sort of the culprint of the issue even though i am not sure what is causing it .

For reference i am running strimzi 0.27.0 and a kafka cluster 2.8.1

In my case the timeouts on the kafka-exporter ( which seems to be about metadata refreshing ) seems related to having the topic operator running or not .

  • When the cluster is just created and no topics exist kafka-exporter works
  • as soon as i create a topic withtout the topic operator running kafka-exporter starts to timeout
  • If the topic operator is running, and so the strimzi topics are created the kafka-exporter works and never timeout even when extra topics are created by hand and not through the KafkaTopic CRD

Steps to reproduce in my case

  • create kafka without Entity Operator
  • No topics exits in the cluster
$ kubectl exec -t -i test-kafka-0 -c kafka -- ./bin/kafka-topics.sh --bootstrap-server localhost:9093 --list

#
  • kafka-exporter works fine
  • create topic manually
$ kubectl exec -t -i test-kafka-0 -c kafka -- ./bin/kafka-topics.sh --bootstrap-server localhost:9093 --create --topic test
Created topic test.

$ kubectl exec -t -i test-kafka-0 -c kafka -- ./bin/kafka-topics.sh --bootstrap-server localhost:9093 --list
test
  • kafka-exporter starts to timeout on refreshing metadata
test-kafka-exporter-54d7cb8558-4nrkv test-kafka-exporter I0610 13:19:12.711584      11 kafka_exporter.go:366] Refreshing client metadata
test-kafka-exporter-54d7cb8558-4nrkv telegraf 2022-06-10T13:20:03Z E! [inputs.prometheus] Error in plugin: error making HTTP request to http://127.0.0.1:9404/metrics: Get "http://127.0.0.1:9404/metrics": context deadline exceeded (Client.Timeout exceeded while awaiting headers) 
  • delete topic manually
$ kubectl exec -t -i test-kafka-0 -c kafka -- ./bin/kafka-topics.sh --bootstrap-server localhost:9093 --delete --topic test

$ kubectl exec -t -i test-kafka-0 -c kafka -- ./bin/kafka-topics.sh --bootstrap-server localhost:9093 --list
  • kafka-expoter works fine
  • add topic operator to kafka resource
  • strimzi topics are created
$ kubectl exec -t -i test-kafka-0 -c kafka -- ./bin/kafka-topics.sh --bootstrap-server localhost:9093 --list
__consumer_offsets
__strimzi-topic-operator-kstreams-topic-store-changelog
__strimzi_store_topic
  • kafka-exporte works fine
  • create topic manually as before
$ kubectl exec -t -i test-kafka-0 -c kafka -- ./bin/kafka-topics.sh --bootstrap-server localhost:9093 --create --topic test
Created topic test.

$ kubectl exec -t -i test-kafka-0 -c kafka -- ./bin/kafka-topics.sh --bootstrap-server localhost:9093 --list
__consumer_offsets
__strimzi-topic-operator-kstreams-topic-store-changelog
__strimzi_store_topic
test
  • kafka-exporte works fine

edit:

I think that in my case this is caused by the lack of __consumer_offsets topic and that must make kafka-exporter upset for some reason

this topic is created by kafka when the first client using consumer management API connects.

I checked on other cluster wher i don't have the entity operator and where i knew i did not have the issue ... but there i am running a test mirror maker which is triggering the creation of the __consumer_offstes topic

primeroz avatar Jun 10 '22 13:06 primeroz

@primeroz The Kafka Exporter will need to collect the committed offsets form there to collect the lag. So in some way it makes sense that it needs this topic and without any consumers, the lag monitoring makes little sense. But I can see how it can behave in some nicer way as well. Maybe this is something you can raise to the Kafka Exporter project?

scholzj avatar Jun 10 '22 13:06 scholzj

yes was planning to , in the mantime i came here because i remembered this issue existed when i noticed this issue on one of our clusters and so i thought it would be good to update here.

I don't undertand why the kafka-exporter does not trigger the creation of the topic but that's above my skillset

Now i really jsut wanted to make sure there no issues with strimzi sinc ei had deployed quite a few of them ! :)

primeroz avatar Jun 10 '22 13:06 primeroz

I don't think it really uses the consumer groups. It just reads and decodes the content of it. So it does not trigger its creation.

scholzj avatar Jun 10 '22 14:06 scholzj

Triaged on 2.8.2022: This has to be fixed in the Kafka Exporter. Once it has a new release, we can update Strimzi to use it. We should also add a warning to the docs about this problem (=> e.g. something like If you don't use consumer groups, it will not work ... just with more fancy wording :-o). It can be added for example somewhere here: https://strimzi.io/docs/operators/latest/deploying.html#con-metrics-kafka-exporter-lag-str

CC @PaulRMellor ^^^

We need to also re-open the discussion about the Kafka Exporter future since it is a long time since a new release or some fixed issues.

scholzj avatar Aug 02 '22 16:08 scholzj

Hi @scholzj -- I'll add something to the docs.

PaulRMellor avatar Aug 26 '22 15:08 PaulRMellor