loki
loki copied to clipboard
Autoscaling the read path
Is your feature request related to a problem? Please describe. The team at Grafana Labs, and customers alike, would like to know how to best autoscale the read path to provide the best possible throughput for the most optimal TCO.
Describe the solution you'd like
First we will start with a technical implementation which is tested on Grafana Cloud Logs hosted service. At the moment, our initial thoughts are to use KEDA. We added a new metric in the query-scheduler
component in https://github.com/grafana/loki/pull/5658 which should provide us with the best possible scaling indication for KEDA to reference.
We will use k6 to load-test a Loki installation to validate the KEDA and Loki configurations. We may provide this benchmarking suite once we're done if it's in a shareable state.
Once we have a solution we are satisfied with, we will publish a guide under https://grafana.com/docs/loki/latest/operations/ on how to configure autoscaling of the read path, and any considerations/tradeoffs we have found.
Describe alternatives you've considered We considered using the native Kubernetes HPA, but we rejected this option for the following reasons:
- It can only work on pod resources (CPU, memory) right now, which is not an ideal scaling indicator in our use-case
- Utilising custom metrics will only be stable as of v1.23
Additional context
We will start by implementing autoscaling of the querier
components using a metric from the query-scheduler
component. If you are running the query-frontend
only, please leave a comment below and we will consider expanding our scope to also test the query-frontend
.
The scope of this work will be restricted to Kubernetes deployments only.
cc @AndreZiviani @zswanson @arvindkonar after our discussion in Slack
@dannykopping I am putting a link to a similar issue we are experimenting in the SIG operator: #5339
TL;DR; We are spiking write now how we can use Horizontal Pod Autoscaler with custom.metrics.k8s.io to scale ingestion and query path on k8s. IFF successfully this could make it as a feature in loki-operator, too.
@periklis good to know :+1
AFAIK the custom.metrics.k8s.io
is only available with autoscaling/v2
, which is only stable as of v.1.23, correct?
@periklis good to know :+1
AFAIK the
custom.metrics.k8s.io
is only available withautoscaling/v2
, which is only stable as of v.1.23, correct?
It was available before in the betas but yes if you say stable that's correct.
Don't get me wrong, according to my ad-hoc research probably KEDA is the way to move forward in future (FWIW at RedHat we dropped support for custom metrics after OpenShift 4.8), but KEDA can be a source for HPA too. Our experiments look after running the horizontal autoscaler for a whole path (e.g. ingestion, query) and what this means for reconciling the HPA objects properly.
We'd love to see autoscaling for querier as well as query frontend..although querier would be higher priority..currently, query schedular isn't part of Loki-distributed helm chart so would be great if we can get that support added as well.
This feature of automatic scaling is great! At present, our querier replica is 256. In order to cope with the sudden peak query traffic。 our previous idea was to host the workload of queirer through function computing, but we have not had time to test and verify it. Currently, cassandra has been crashing all the time.
Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.
We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.
Stalebots are also emotionless and cruel and can close issues which are still very relevant.
If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.
We regularly sort for closed issues which have a stale
label sorted by thumbs up.
We may also:
- Mark issues as
revivable
if we think it's a valid issue but isn't something we are likely to prioritize in the future (the issue will still remain closed). - Add a
keepalive
label to silence the stalebot if the issue is very common/popular/important.
We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.
keepalive
@dannykopping Any news regarding this issue? I see it has been marked in progress, do you know which metrics you are planning to use? I was considering using recent queries "time waited in queue", what is your approach?
@Raskyld Have a look here:
- grafana/loki/pull/6801
We completed this work some time ago but forgot to close this out.
For future readers: we produced this documentation page describing how to autoscale queriers.