loki icon indicating copy to clipboard operation
loki copied to clipboard

Availability zone local replication

Open thatsmydoing opened this issue 3 years ago • 10 comments

Is your feature request related to a problem? Please describe. I'm looking to run loki in a standard EKS cluster across a few AZs. All the logs will be coming from within the cluster and I'd like to avoid the inter-zone bandwidth cost when sending logs. I don't need HA across zones since if an AZ goes down then there won't be any logs since all the clients for that AZ will be down anyway.

Describe the solution you'd like I understand that loki now support zone aware replication which ensures that data exists in multiple AZs. I'd like the opposite which ensures that the distributor only forwards data to ingesters in the same AZ. Queriers should still be able to query from all available AZs.

Describe alternatives you've considered I've considered just running multiple loki clusters per AZ but that's a bit unwieldy as querying multiple clusters is not well supported as described in https://github.com/grafana/loki/issues/1866

I've also thought about blocking access between distributors and ingesters in different AZs but that gets quite spammy and I'm not sure if it's even safe to do.

Additional context When I was first looking into loki, my mental model was that each component has it's own "ring" so I was trying to do something like having a different ingester ring for the distributor to only be ingesters in the same AZ and a different ingester ring for the queriers which would be all of them. Unfortunately, that's not the model that loki uses.

thatsmydoing avatar Feb 04 '22 07:02 thatsmydoing

Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

  • Mark issues as revivable if we think it's a valid issue but isn't something we are likely to prioritize in the future (the issue will still remain closed).
  • Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.

stale[bot] avatar Apr 17 '22 07:04 stale[bot]

Not stale

thatsmydoing avatar Apr 17 '22 07:04 thatsmydoing

We are also facing this issue, it would be good to have some design guide how to avoid these costs

kovaxur avatar Jun 14 '22 19:06 kovaxur

I'm new to loki, so I may have misunderstood how storage and indexing works, but I'm pondering if it may work to do something like this:

  • Use a cloud-based storage for indexes and data (i.e boltdb-shipper, with S3 as blob storage)
  • Run N replicas of loki writer, where N is the number of availability zones. Each replica in a different zone.
  • Upon startup of loki writer, run an initContainer that figures out its zone, and sets labels on the writer pod.
  • Define one Kubernetes service per zone, targeting only the loki writer pods in its zone, by selecting on above labels.
  • In promtail, also run an initContainer that figures out the zone, then adjusts the client URL to the per-zone service for the zone in question.

Perhaps I'm missing something in the above idea?

Personally, I would ignore the traffic from/to loki reader replicas, as I write much more than I read, but that would depend on use case.

forsberg avatar Oct 20 '22 06:10 forsberg

Can this not largely be solved by Kubernetes itself via "topology aware hints": https://kubernetes.io/docs/concepts/services-networking/topology-aware-hints/? In other words, simply setting the service.kubernetes.io/topology-aware-hints annotation to auto on the services that are used to reach loki writer pods?

janvanbesien-ngdata avatar Feb 02 '23 09:02 janvanbesien-ngdata

It's been a while so I might be mistaken, services can handle partitioning promtail to distributor communication which is good, but it does not address distributor to ingester communication. While the diagrams show that a "writer" can have a distributor and ingester running together, that does not mean that the distributor favors the "local" ingester.

A distributor may send the log data to any ingester even in a different zone. Service discovery for ingesters is also done internally via rings so kubernetes services has no impact here.

thatsmydoing avatar Feb 02 '23 10:02 thatsmydoing

Just chiming in here - as we have the exact same desire. On ingestion, data has the potential to cross-AZs many times in a Loki Microservice Distributed model:

  • Via Kubernetes Service: promtail -> loki-gateway
  • Via Kubernetes Service: loki-gateway -> loki-distributor
  • Via Loki GRPC Communication over Memberlist: loki-distributor -> loki-ingester-X, loki-ingester-Y, loki-ingester-Z

We can now take care of the first two hops by running a larger number of loki-gateway and loki-distributor pods. However, the third hop is tricky because it's actually multiplied by our replication factor:

  • ~Via Kubernetes Service: promtail -> loki-gateway~
  • ~Via Kubernetes Service: loki-gateway -> loki-distributor~
  • Via Loki GRPC Communication over Memberlist: loki-distributor -> loki-ingester-X (replica stream 1)
  • Via Loki GRPC Communication over Memberlist: loki-distributor -> loki-ingester-Y (replica stream 2)
  • Via Loki GRPC Communication over Memberlist: loki-distributor -> loki-ingester-Z (replica stream 3)

We want to tell the Loki system to prioritize sending the data to Loki Ingesters within the same zone whenever possible. If there are too few ingesters to make that happen, it should still try to select ingesters within the same zone to whatever degree it can. For example: If there are 2 ingesters in ZoneA and 2 in ZoneB... then a Distributor in ZoneA should send 2 streams to ZoneA and only one stream to ZoneB.

diranged avatar Nov 14 '23 00:11 diranged

I am also facing this issue.

carlopalacio avatar Nov 14 '23 14:11 carlopalacio

Just chiming in here - as we have the exact same desire. On ingestion, data has the potential to cross-AZs many times in a Loki Microservice Distributed model:

* Via Kubernetes Service: `promtail -> loki-gateway`

* Via Kubernetes Service: `loki-gateway -> loki-distributor`

* Via Loki GRPC Communication over Memberlist: `loki-distributor -> loki-ingester-X, loki-ingester-Y, loki-ingester-Z`

We can now take care of the first two hops by running a larger number of loki-gateway and loki-distributor pods. However, the third hop is tricky because it's actually multiplied by our replication factor:

* ~Via Kubernetes Service: `promtail -> loki-gateway`~

* ~Via Kubernetes Service: `loki-gateway -> loki-distributor`~

* Via Loki GRPC Communication over Memberlist: `loki-distributor -> loki-ingester-X` (replica stream 1)

* Via Loki GRPC Communication over Memberlist: `loki-distributor -> loki-ingester-Y` (replica stream 2)

* Via Loki GRPC Communication over Memberlist: `loki-distributor -> loki-ingester-Z` (replica stream 3)

We want to tell the Loki system to prioritize sending the data to Loki Ingesters within the same zone whenever possible. If there are too few ingesters to make that happen, it should still try to select ingesters within the same zone to whatever degree it can. For example: If there are 2 ingesters in ZoneA and 2 in ZoneB... then a Distributor in ZoneA should send 2 streams to ZoneA and only one stream to ZoneB.

As described above, we're stuck because of the distributor. In our case we skipped the gateway by writing directly to the distributors.

From there we don't really see how to prevent distributors from sending traffic cross AZ.

If someone has any clues that'd be neat.

DaazKu avatar Jun 09 '25 22:06 DaazKu

One step we are doing in our non production environments is to reduce the replication from 3 to 2. We just finished doing this with mimir and as expected saw a 30% reduction in our network traffic.

There is also so a PR in flight to introduce S2 compression that could be used to reduce payload sizes: https://github.com/grafana/loki/pull/17964. This won't prevent the traffic but something that would reduce it by a small amount

mveitas avatar Jun 12 '25 12:06 mveitas

We want to tell the Loki system to prioritize sending the data to Loki Ingesters within the same zone whenever possible. If there are too few ingesters to make that happen, it should still try to select ingesters within the same zone to whatever degree it can. For example: If there are 2 ingesters in ZoneA and 2 in ZoneB... then a Distributor in ZoneA should send 2 streams to ZoneA and only one stream to ZoneB.

We are also looking into this issue of costly cross-AZ traffic on e.g. AWS. Certainly it's a trade-off to not send the chunks to two distinct zones if this is actually your failure domain. A sudden knockout of a single availability zone would cause data that was already accepted from clients to potentially be lost in transit, as it was not written to durable storage yet. But this being a tradeoff that some people might be willing to make (as total AZ failures are seldom enough for them), it's work looking into options and potential configurations:

I dove into the ring code that is about sending chunks to different zone awareness, which is likely used across Loki, Mimir and Tempo. Seems it's this section here, which explains the current mechanisms and capabilities: https://github.com/grafana/dskit/blob/f05d091ab3f5ae2688d1b163cdfe00c76dcd80f4/ring/replication_set.go#L157-L219

  • One idea would be to add yet another strategy of distributing chunks and selecting ring members, which actively avoids using another zone and only fails over if there are no members in the local zone - this would implement the idea behind this very issue here.

  • But looking at the existing options already shows some potential for saving on traffic:

  1. Use MinimizeRequests (PR: https://github.com/grafana/dskit/pull/306/, description: https://github.com/grafana/dskit/blob/f05d091ab3f5ae2688d1b163cdfe00c76dcd80f4/ring/replication_set.go#L185-L191). This is about optimistically trying with two replication targets first and only use a third if no quorum could be reached this way. It increases latency in the happy path, but saves 33% if the bandwidth - potentially being cross-zone. This was actually made available for Mimir via https://github.com/grafana/mimir/pull/5202, but it seems it's not exposed as an option for Loki. So this feature only seems a config switch away from being available?

@mveitas this would also achieve the traffic reduction you describe in https://github.com/grafana/loki/issues/5319#issuecomment-2966591996 for the happy path, but with the potential to involve the 3rd zone of needed. BTW, I am still wondering how a replication factor of 2 works for you in regards to the quorum if one zone is down ... see https://github.com/grafana/dskit/blob/f699a5a29cdc8a9bcc7844e935b477b484d08db2/ring/replication_strategy.go#L34-L41 about "X/2+1".

  1. ZoneSorter (https://github.com/grafana/dskit/pull/440) This is about ordering the available zones and giving them a priority. Apart from the number of replicas and the attempt to minimize the requests (see 1), it's also important to use the local zone as part of the initial set, otherwise it might be that a distributor in zone A sends chunks to B and C, not reducing cross-az traffic at all.

@charleskorn, the contributor of these capabilities already wrote in PR (https://github.com/grafana/dskit/pull/440) that ...

This is useful when the caller has additional knowledge about the zones and wants to prioritise using some over others. For example, the caller might have observed that some zones are faster or under less load and want to preference these over other, slower or more heavily loaded zones. Unfortunately there seems to be no additional sorter doing a "use my own zone first" kind of sorting yet?

All this is apparently just about the write path. I did not yet dive into any options to also guide requests from the query-frontend to the queriers. Maybe(!) there is no ring and zoning used here and it's actually a Kubernetes service that is used and https://github.com/grafana/loki/pull/19558 would help?

To sum up

  • 1+2 would would give us a big improvement in terms of cross-az traffic without reducing the robustness or losing the zone failure-domain
  • Staying strictly within the local zone, even though Loki (or the other similar architectures services like Mimir) is deployed across AZs needs further changes and, as said initially, does have the trade-off of not having a failure domain of "zone" anymore. Seems like the replicator would need to work with a (by zone) sorted list of instances (with my zone on top) and then also prefer these instances initially in this section https://github.com/grafana/dskit/blob/f699a5a29cdc8a9bcc7844e935b477b484d08db2/ring/replication_set.go#L56-L78

frittentheke avatar Oct 24 '25 10:10 frittentheke