loki Autoscaling the read path

Is your feature request related to a problem? Please describe. The team at Grafana Labs, and customers alike, would like to know how to best autoscale the read path to provide the best possible throughput for the most optimal TCO.

Describe the solution you'd like First we will start with a technical implementation which is tested on Grafana Cloud Logs hosted service. At the moment, our initial thoughts are to use KEDA. We added a new metric in the query-scheduler component in https://github.com/grafana/loki/pull/5658 which should provide us with the best possible scaling indication for KEDA to reference.

We will use k6 to load-test a Loki installation to validate the KEDA and Loki configurations. We may provide this benchmarking suite once we're done if it's in a shareable state.

Once we have a solution we are satisfied with, we will publish a guide under https://grafana.com/docs/loki/latest/operations/ on how to configure autoscaling of the read path, and any considerations/tradeoffs we have found.

Describe alternatives you've considered We considered using the native Kubernetes HPA, but we rejected this option for the following reasons:

It can only work on pod resources (CPU, memory) right now, which is not an ideal scaling indicator in our use-case
Utilising custom metrics will only be stable as of v1.23

Additional context We will start by implementing autoscaling of the querier components using a metric from the query-scheduler component. If you are running the query-frontend only, please leave a comment below and we will consider expanding our scope to also test the query-frontend.

The scope of this work will be restricted to Kubernetes deployments only.

Mar 18 '22 09:03 dannykopping

cc @AndreZiviani @zswanson @arvindkonar after our discussion in Slack

Mar 18 '22 09:03 dannykopping

@dannykopping I am putting a link to a similar issue we are experimenting in the SIG operator: #5339

TL;DR; We are spiking write now how we can use Horizontal Pod Autoscaler with custom.metrics.k8s.io to scale ingestion and query path on k8s. IFF successfully this could make it as a feature in loki-operator, too.

Mar 18 '22 09:03 periklis

@periklis good to know :+1

AFAIK the custom.metrics.k8s.io is only available with autoscaling/v2, which is only stable as of v.1.23, correct?

Mar 18 '22 09:03 dannykopping

@periklis good to know :+1

AFAIK the custom.metrics.k8s.io is only available with autoscaling/v2, which is only stable as of v.1.23, correct?

It was available before in the betas but yes if you say stable that's correct.

Mar 18 '22 10:03 periklis

Don't get me wrong, according to my ad-hoc research probably KEDA is the way to move forward in future (FWIW at RedHat we dropped support for custom metrics after OpenShift 4.8), but KEDA can be a source for HPA too. Our experiments look after running the horizontal autoscaler for a whole path (e.g. ingestion, query) and what this means for reconciling the HPA objects properly.

Mar 18 '22 12:03 periklis

We'd love to see autoscaling for querier as well as query frontend..although querier would be higher priority..currently, query schedular isn't part of Loki-distributed helm chart so would be great if we can get that support added as well.

Mar 18 '22 13:03 arvindkonar

This feature of automatic scaling is great！ At present, our querier replica is 256. In order to cope with the sudden peak query traffic。 our previous idea was to host the workload of queirer through function computing, but we have not had time to test and verify it. Currently, cassandra has been crashing all the time.

Mar 18 '22 14:03 liguozhong

Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

Mark issues as revivable if we think it's a valid issue but isn't something we are likely to prioritize in the future (the issue will still remain closed).
Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.

Apr 18 '22 14:04 stale[bot]

keepalive

Apr 18 '22 20:04 AndreZiviani

@dannykopping Any news regarding this issue? I see it has been marked in progress, do you know which metrics you are planning to use? I was considering using recent queries "time waited in queue", what is your approach?

Aug 01 '22 12:08 raskyld

@Raskyld Have a look here:

grafana/loki/pull/6801

Aug 01 '22 12:08 periklis

We completed this work some time ago but forgot to close this out.

For future readers: we produced this documentation page describing how to autoscale queriers.

Jun 29 '23 06:06 dannykopping

loki loki copied to clipboard

Autoscaling the read path

loki
loki copied to clipboard