Burrow icon indicating copy to clipboard operation
Burrow copied to clipboard

Question: How to observe consumer lag, in a centralized way, in confluent cloud

Open robertpaker888 opened this issue 4 years ago • 2 comments

Hi,

We want to centralize the way we handle consumer lag. Currently we have a number of services deployed in an aws ECS fargate cluster. Each service is configured to receive a statics payload which we inspect for a consumer lag and then publish a custom metric into cloudwatch, which allows us to act accordingly. The downside to this approach is that we have to update any new services with this code. Also, its reliant on the driver providing this feature and not all our services are in the written in the same language.

Burrow would appear to be the tool for the job. Although, it seems from my reading its doesn't work in confluent cloud. Is this still the case?

If Burrow cannot work in confluent cloud we see two other options described below. Your thoughts would be greatly appreciated:

Option 1.

Create a new ECS service in our cluster which ran a script which would in turn run the "kafka-consumer-group" console admin tool on an interval and then publish the cloudwatch metric. In your experience does these seem like a sensible approach that could work?

Option 2.

The other I've seen, that may work in confluent cloud is to setup an alert on the consumer lag in control-center and then have a lambda in aws issue a rest-api request to Get the alert history on a schedule and then push the metric from there

robertpaker888 avatar Jul 06 '20 07:07 robertpaker888

Hi,

I'm in a similar situation. I think option 2 is not working as Confluent Cloud doesn't keep consumer lag alerts history as per https://github.com/vdesabou/kafka-docker-playground/tree/master/ccloud/ccloud-demo#alerts .

Have you looked at other options available https://docs.confluent.io/cloud/current/monitoring/monitor-lag.html# ? I'm wondering what was your decision?

Regards

vlasov01 avatar Feb 10 '21 22:02 vlasov01

We went with option 1, its working well and doing what we need. We have a Fargate service running in AWS which is continuously running a bash script. When the Fargate service starts up the bash script reads in a text file containing the consumer group names we want to monitor. It then issues a kafka-consumer-groups command which returns the lag. We then parse the results of the kafka-consumer-groups command to get the lag value. Next, we publish a custom metric into AWS cloud watch. Finally, we use the custom metric to auto scale the relevant services.

Note our docker image used to run the service is packaged with the Kafka command line tools so we have access to the kafka-consumer-groups command. See below:

RUN curl https://downloads.apache.org/kafka/2.5.0/kafka_2.12-2.5.0.tgz --output KafkaCommandLineTools.tgz

The link below shows how to configure the command lines tools to work with the confluent cluster https://www.confluent.io/blog/using-apache-kafka-command-line-tools-confluent-cloud/

robty123 avatar Feb 11 '21 10:02 robty123