prometheus-kafka-consumer-group-exporter icon indicating copy to clipboard operation
prometheus-kafka-consumer-group-exporter copied to clipboard

Very long query time with many topics and consumer groups

Open panda87 opened this issue 6 years ago • 19 comments

Hi

Im using this plugin for a while, and it worked pretty well while I had small amount of consumers. Today, I added many other consumers and new topics and 2 things started to appear

  1. I started to get failures due to long consumer_ids (I used pznamensky branch and this fixed that) - can you pls merge, btw
  2. The http query time increased from 2-3 seconds to 20 seconds, even when I changed max-concurrent-group-queries to 10 - it just effected my CPU cores and increased the load to 500%

Do you know why this happens?

D.

panda87 avatar Oct 30 '17 13:10 panda87

I am seeing the same behavior, it's so bad that prometheus is skipping it because the scrape to the /metrics endpoint is taking to long. Seems to be related to the number of partitions

zot42 avatar Nov 01 '17 14:11 zot42

pznamensky branch and this fixed that

@panda87 Would you mind sharing a link to his fix/branch and I can look into bringing it into master here? If you could submit a PR that would be even better! 🙏

JensRantil avatar Nov 14 '17 11:11 JensRantil

I am seeing the same behavior, it's so bad that prometheus is skipping it because the scrape to the /metrics endpoint is taking to long.

@zot42 FYI, you can increase that timeout in Prometheus, though.

JensRantil avatar Nov 14 '17 11:11 JensRantil

Do you know why this happens?

@panda87 Would you mind measuring how long it takes to list our your topics using kafka-consumer-groups.sh as well as querying the lag? I've also created #47 which would help diagnose issues like yours.

JensRantil avatar Nov 14 '17 12:11 JensRantil

Would you mind sharing a link to his fix/branch and I can look into bringing it into master here? If you could submit a PR that would be even better!

@panda87 Never mind. Please ignore my comment, I just saw his PR. ;)

JensRantil avatar Nov 14 '17 12:11 JensRantil

@JensRantil I used his PR, but I still get errors like this:

goroutine 1290463 [running]:
panic(0x7f2880, 0xc420012080)
	/usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/kawamuray/prometheus-kafka-consumer-group-exporter/kafka.(*regexpParser).parseLine(0xc4200acee0, 0xc4207680b1, 0xe1, 0xc4200ef400, 0x0, 0x10)
	/go/src/github.com/kawamuray/prometheus-kafka-consumer-group-exporter/kafka/parsing.go:134 +0x4d2
github.com/kawamuray/prometheus-kafka-consumer-group-exporter/kafka.(*regexpParser).Parse(0xc4200acee0, 0xc420768000, 0xed1, 0xc420126e80, 0x77, 0x0, 0x0, 0x0, 0xa51ce0, 0xc420552550)
	/go/src/github.com/kawamuray/prometheus-kafka-consumer-group-exporter/kafka/parsing.go:74 +0x189
github.com/kawamuray/prometheus-kafka-consumer-group-exporter/kafka.(*DelegatingParser).Parse(0xc4200fb120, 0xc420768000, 0xed1, 0xc420126e80, 0x77, 0x3, 0xc420768000, 0xed1, 0xc420126e80, 0x77)
	/go/src/github.com/kawamuray/prometheus-kafka-consumer-group-exporter/kafka/parsing.go:211 +0x8f
github.com/kawamuray/prometheus-kafka-consumer-group-exporter/kafka.(*ConsumerGroupsCommandClient).DescribeGroup(0xc4200e8c60, 0xa58460, 0xc42011d080, 0xc4200777a0, 0xc, 0x1, 0x13bb, 0x0, 0xc420502757, 0xc)
	/go/src/github.com/kawamuray/prometheus-kafka-consumer-group-exporter/kafka/collector.go:79 +0x105
github.com/kawamuray/prometheus-kafka-consumer-group-exporter/sync.(*FanInConsumerGroupInfoClient).describeLoop.func1(0xc420011780, 0xc4200777a0, 0xc, 0xc42011d0e0, 0xa58460, 0xc42011d080, 0xc42011d140)
	/go/src/github.com/kawamuray/prometheus-kafka-consumer-group-exporter/sync/metrics.go:159 +0x71
created by github.com/kawamuray/prometheus-kafka-consumer-group-exporter/sync.(*FanInConsumerGroupInfoClient).describeLoop
	/go/src/github.com/kawamuray/prometheus-kafka-consumer-group-exporter/sync/metrics.go:164 +0x178
time="2017-11-13T07:35:37Z" level=warning msg="unable to find current offset field. line: counter_raw_events             0          1               1               0          event-stream-StreamThread-2-consumer-11d0f14e-4b96-41a7-ac49-eb389ca75e58/172.40.102.56                 event-stream-StreamThread-2-consumer" source="parsing.go:110"
time="2017-11-13T07:35:37Z" level=warning msg="unable to parse int for lag. line: %scounter_raw_events             0          1               1               0          event-stream-StreamThread-2-consumer-11d0f14e-4b96-41a7-ac49-eb389ca75e58/172.40.102.56                 event-stream-StreamThread-2-consumer" source="parsing.go:118"
time="2017-11-13T07:35:37Z" level=warning msg="unable to find current offset field. Line: counter_raw_events             0          1               1               0          event-stream-StreamThread-2-consumer-11d0f14e-4b96-41a7-ac49-eb389ca75e58/172.40.102.56                 event-stream-StreamThread-2-consumer" source="parsing.go:125"
time="2017-11-13T07:35:37Z" level=warning msg="unable to parse int for current offset. Line: %scounter_raw_events             0          1               1               0          event-stream-StreamThread-2-consumer-11d0f14e-4b96-41a7-ac49-eb389ca75e58/172.40.102.56                 event-stream-StreamThread-2-consumer" source="parsing.go:130"
panic: runtime error: index out of range

panda87 avatar Nov 14 '17 12:11 panda87

@panda87 That looks like a different issue than this. Please open a new issue (and specify which version/commit you are running).

JensRantil avatar Nov 14 '17 12:11 JensRantil

Ok, I will create new issue

panda87 avatar Nov 14 '17 12:11 panda87

It seems that now after I pulled this last repo with latest changes I dont get the errors above, so thanks! Now its only the response time, which is still high

panda87 avatar Nov 14 '17 12:11 panda87

@panda87 Good! I know I saw that error when I was recently revamping some of the parsing logic.

JensRantil avatar Nov 14 '17 12:11 JensRantil

I noticed the response time being high too....I wonder if this is actually kafka who is taking a long time to run vs prometheus....

cl0udgeek avatar Nov 16 '17 05:11 cl0udgeek

I noticed the response time being high too....I wonder if this is actually kafka who is taking a long time to run vs prometheus....

I'm pretty sure it is. #47 will help us tell whether that's the case.

JensRantil avatar Nov 16 '17 07:11 JensRantil

any update on this one?

cl0udgeek avatar Dec 13 '17 23:12 cl0udgeek

Unfortunately not. Pull requests are to fix #47. I've been pretty busy lately and haven't had time to get back to this 😥

JensRantil avatar Dec 13 '17 23:12 JensRantil

Any update on this?

cl0udgeek avatar Feb 13 '18 12:02 cl0udgeek

Unfortunately not.

JensRantil avatar Feb 14 '18 15:02 JensRantil

Might be worth mentioning that I had a colleague that claimed lag is now exposed through JMX. A workaround might be to have a look at using jmx_exporter instead of this.

JensRantil avatar Feb 14 '18 15:02 JensRantil

wait what? that'd be awesome if it is! do you know which kafka version? to clarify...its always been there on the consumer side but not on the server side as far as I know

cl0udgeek avatar Feb 14 '18 15:02 cl0udgeek

@k1ng87,

If you're interested in consumer lag it's published via JMX by the consumer:

Old consumer: kafka.consumer:type=ConsumerFetcherManager,name=MaxLag,clientId=([-.\w]+)

New consumer: kafka.consumer:type=consumer-fetch-manager-metrics,client-id={client-id} Attribute: records-lag-max

Replication lag is published by the broker:

kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=([-.\w]+),topic=([-.\w]+),partition=([0-9]+)

See the official Kafka documentation for more details: https://kafka.apache.org/documentation/#monitoring. I checked only version 1.0, the latest one as of now. Hope this helps.

raindev avatar Feb 14 '18 16:02 raindev