Improve performance of consumer-group routes
I have a large cluster of ~45k consumer groups. Opening the Consumer Group tab in the frontend times out in the 25 second default that fetch waits for responses from the backend. I can see in debug logs that the operation takes about 5 minutes to complete the request to GET /api/consumer-groups for my cluster.
Moving discussion with @weeco from discord to this thread:
From @weeco -
The only place where we use websockets would be the messages explore feature.
We have several options to handle this I guess. One option could be that we submit a second request from the frontend to describe the consumer groups we see on each page
So:
1. List all consumer groups => Send consumer group names to the frontend
2. Frontend describes all consumer groups that are visible on the currently selected page
Usually in webdev one would use cursors for pagination but Kafka doesn't have a cursor for us here
Right now we are pulling about 2MB of data from just grabbing consumer groupIDs in our test cluster. As we get to higher environments, the data will get bigger but I am not sure by how much. If we go the route recommended of pulling all cgIDs, storing them locally on the frontend I would imagine browsers might not like that. @weeco any worries with that approach?
Here is what an consumer groupID endpoint looks like:
func (api *API) handleListConsumerGroupIDs() http.HandlerFunc {
return func(rw http.ResponseWriter, req *http.Request) {
cgResp, err := api.KafkaSvc.ListConsumerGroups(req.Context())
if err != nil {
rest.SendRESTError(rw, req, api.Logger, &rest.Error{
Err: fmt.Errorf("failed to list consumer group IDs: %w", err),
Status: http.StatusInternalServerError,
Message: "failed to list consumer group IDs",
IsSilent: false,
})
return
}
res := cgResp.GetGroupIDs()
response := ListConsumerGroupsResponse{
ConsumerGroupNames: sort.StringSlice(res),
}
rest.SendResponse(rw, req, api.Logger, http.StatusOK, response)
}
}
and here is what some meta about hitting that endpoint looks like:
➜ git:(master) wrk -t24 -d60s -c400 "http://localhost:8080/api/list-consumer-group-ids"
Running 1m test @ http://localhost:8080/api/list-consumer-groups-ids
24 threads and 400 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.49s 220.07ms 1.64s 100.00%
Req/Sec 0.29 1.19 10.00 96.70%
91 requests in 1.00m, 232.22MB read
Socket errors: connect 0, read 548, write 0, timeout 89
Requests/sec: 1.51
Transfer/sec: 3.87MB
@msaggar Thanks for the details, 2-4MB is indeed really a lot, but I think in regards to the browser performance not yet a problem (we handle larger data in the explore messages tab I think). However you are right that it doesn't scale infinitely. May I ask why you have 50k+ consumer groups? (Just curious how one can end up with that many consumer groups tbh).
If just the consumer group names itself are already too big, there's no way around doing pagination that is supported by the backend. Our current pagination is fake in the sense that the frontend receives all data and just renders up to 200 items per page. I'd need to think about how we could implement pagination as this would make our Kowl instances somehow stateful - Kafka itself doesn't support the concept of pagination like databases do. Saying that - introducing a database as result cache that would allow us to do pagination is one option 😆 .
The 50k+ consumer groups is likely us not using kafkajs correctly. We start a new consumer for each pod in our service. As pods come in and out we get to this number. I am looking into if we can reduce this number.
And as far as storing some state with Kowl by introducing a DB or cache, I see there is a ticket here: https://github.com/cloudhut/kowl/issues/20 though I am not sure any investigation has been done?
I would be willing to keep exploring two separate routes on the backend and see how that goes for now (pretty much the original plan).
- Fetch all consumer group ids when someone opens the consumer groups page
- Store these somewhere in the frontend
- Fetch metadata about consumer groups of page size N
I feel like only showing the consumer group names in that page will be very disappointing. How about this:
- Return all consumer group names to the frontend
- The frontend can request all consumer group details in bulk for each page (may require a new backend endpoint).
Do you want to proceed with coding this (you could also focus on just backend or frontend if you are just comofrtable with one part of the stack)?
@msaggar I had a look today and figured out (together with Travis) that the Kafka client stacks lots of requests without opening further connections to send the many admin requests concurrently. Thus it takes longer than necessary until all consumer group offsets are fetched. The Kafka client (franz-go) will be patched, so that it is controllable how many requests can be sent concurrently per broker, so that more requests can be handled concurrently.
That would probably work for 99% of all users, but for users with more than 5k Consumer Groups the response size becomes problematic. Therefore we still have to involve the frontend to address this. (I strongly recommend looking into your consumer groups problem though, having that many consumer groups is not really healthy)
@weeco yeah i agree, we are taking investigating the kafka library we are using to see if there is anyway we can reduce the number of consumer groups.
Would you still like to continue with the change discussed above?
@weeco we lowered our offset retention and now have ~2k consumers so I think this should be fine
- Requires converting to kadm, which automatically batches and falls back to per-group requesting if the broker does not support it
- For Redpanda: requires redpanda to support batch requests
Former can be done in console