helm-charts
helm-charts copied to clipboard
Loki-distributed doesn't work out of the box
I'm running on EKS 1.18
If I take the loki-distributed helm chart and apply it with the values.yml as it is written, I end up with the distributor, ingester and querier in a crashloopbackoff state complaining failed to create memberlist: Failed to get final advertise address: no private IP address found, and explicit IP not provided
.
This seems to be a reasonably common error related to the memberlist. I understand that I should provide a private IP address. However it's unclear what address I should be adding.
If I add:
bind_addr:
- 127.0.0.1
to the memberlist config, things seem to get a little further. All the containers at least go ready, but they eventually fail, and the ring fails to get any members added to it (shown by navigating to the /ring url of the distributor service)
127.0.0.1 is a complete guess based on trial and error as I can't find any documentation explaining what IP address I should be applying here.
I have also tried 172.120.0.0/16 which is the cidr range of IP addresses available to my pods. This time, I see see ingester being added to the ring. It is even temporarily healthy, before the state goes to 'unhealthy' and everything grinds to a halt again.
Here are some logs from the ingester that may or may not be useful? By this point, the state of the instance in the ring is 'unhealthy', even though it seems to be uploading the tables somewhere? Also during this time, both the querier and the distributor are reporting err="empty ring"
│ level=info ts=2021-01-04T15:35:16.752555087Z caller=lifecycler.go:547 msg="instance not found in ring, adding with no tokens" ring=ingester │
│ level=info ts=2021-01-04T15:35:16.752736192Z caller=lifecycler.go:394 msg="auto-joining cluster after timeout" ring=ingester │
│ level=info ts=2021-01-04T15:35:16.756522831Z caller=memberlist_client.go:461 msg="joined memberlist cluster" reached_nodes=2 │
│ ts=2021-01-04T15:35:17.753163266Z caller=memberlist_logger.go:74 level=warn msg="Got ping for unexpected node 'loki-test-loki-distributed-querier-0-104bc685' from=[::]:7946" │
│ ts=2021-01-04T15:35:18.254113836Z caller=memberlist_logger.go:74 level=warn msg="Got ping for unexpected node loki-test-loki-distributed-querier-0-104bc685 from=127.0.0.1:54602" │
│ ts=2021-01-04T15:35:18.254175417Z caller=memberlist_logger.go:74 level=warn msg="Got ping for unexpected node 'loki-test-loki-distributed-querier-0-104bc685' from=[::]:7946" │
│ ts=2021-01-04T15:35:18.254210598Z caller=memberlist_logger.go:74 level=error msg="Failed fallback ping: EOF" │
│ ts=2021-01-04T15:35:18.752862962Z caller=memberlist_logger.go:74 level=info msg="Suspect loki-test-loki-distributed-querier-0-104bc685 has failed, no acks received" │
│ ts=2021-01-04T15:35:18.753670742Z caller=memberlist_logger.go:74 level=warn msg="Got ping for unexpected node 'loki-test-loki-distributed-distributor-bd4dbdd64-mklf2-51c9d67f' from=[::]:7946" │
│ ts=2021-01-04T15:35:19.254242102Z caller=memberlist_logger.go:74 level=warn msg="Got ping for unexpected node loki-test-loki-distributed-distributor-bd4dbdd64-mklf2-51c9d67f from=127.0.0.1:54638" │
│ ts=2021-01-04T15:35:19.254377786Z caller=memberlist_logger.go:74 level=error msg="Failed fallback ping: EOF" │
│ ts=2021-01-04T15:35:20.75323704Z caller=memberlist_logger.go:74 level=info msg="Suspect loki-test-loki-distributed-distributor-bd4dbdd64-mklf2-51c9d67f has failed, no acks received" │
│ ts=2021-01-04T15:35:21.753527161Z caller=memberlist_logger.go:74 level=warn msg="Got ping for unexpected node 'loki-test-loki-distributed-querier-0-104bc685' from=[::]:7946" │
│ ts=2021-01-04T15:35:22.254288357Z caller=memberlist_logger.go:74 level=warn msg="Got ping for unexpected node loki-test-loki-distributed-querier-0-104bc685 from=127.0.0.1:54706" │
│ ts=2021-01-04T15:35:22.254566363Z caller=memberlist_logger.go:74 level=error msg="Failed fallback ping: EOF" │
│ ts=2021-01-04T15:35:22.753134246Z caller=memberlist_logger.go:74 level=info msg="Marking loki-test-loki-distributed-querier-0-104bc685 as failed, suspect timeout reached (0 peer confirmations)" │
│ ts=2021-01-04T15:35:24.752835308Z caller=memberlist_logger.go:74 level=info msg="Suspect loki-test-loki-distributed-querier-0-104bc685 has failed, no acks received" │
│ ts=2021-01-04T15:35:24.753416282Z caller=memberlist_logger.go:74 level=info msg="Marking loki-test-loki-distributed-distributor-bd4dbdd64-mklf2-51c9d67f as failed, suspect timeout reached (0 peer confirmations)" │
│ ts=2021-01-04T15:35:24.754445635Z caller=memberlist_logger.go:74 level=warn msg="Got ping for unexpected node 'loki-test-loki-distributed-distributor-bd4dbdd64-mklf2-51c9d67f' from=[::]:7946" │
│ ts=2021-01-04T15:35:25.255044246Z caller=memberlist_logger.go:74 level=warn msg="Got ping for unexpected node loki-test-loki-distributed-distributor-bd4dbdd64-mklf2-51c9d67f from=127.0.0.1:54800" │
│ ts=2021-01-04T15:35:25.255200831Z caller=memberlist_logger.go:74 level=error msg="Failed fallback ping: EOF" │
│ ts=2021-01-04T15:35:28.756425213Z caller=memberlist_logger.go:74 level=info msg="Suspect loki-test-loki-distributed-distributor-bd4dbdd64-mklf2-51c9d67f has failed, no acks received" │
│ level=info ts=2021-01-04T15:36:16.751323445Z caller=table_manager.go:171 msg="uploading tables"
The same thing. I think the reason that ring service doesn't route traffic to non-ready pods. I Will try it later
The same thing. I think the reason that ring service doesn't route traffic to non-ready pods. I Will try it later
nope, it's not helped, any ideas?
Loki-distributed currently is very raw, hope it will become more user-friednly
Can't make it work stable and getting the same err="empty ring"
We are facing the same issues. Are you guys using a custom (vendor) CNI or the AWS bundled one?
I configured loki-distributed
using the example setup in the repository on EKS last week and it works nicely. It performs surprisingly well.
It's running 1.19.6-eks
using mostly default configuration from when you click "create cluster" from the AWS Console.
We are facing the same issues. Are you guys using a custom (vendor) CNI or the AWS bundled one?
We're tried on gke and aks with calico addons.
for us it works with bare aws eks and bundled CNI. It fails with the symptoms described here when using cilium in overlay mode.
Loki-distributed currently is very raw, hope it will become more user-friednly
@Zeka13 Can you elaborate? The chart does work very well and is pretty full-featured.
The issue here is probably not related to the chart. I'd suggest to reach out in Grafana's community forum or on Grafana Slack to get help on EKS specific issues.
Loki distributed does not work on GKE private clusters entirely. The gossip network will fail every time.
Loki distributed does not work on GKE private clusters entirely. The gossip network will fail every time.
Did you ever figure out how to get this to run?
Adding the
-memberlist.bind_addr=127.0.0.1
Cli flag to all systems allowed them to start up. Running on GKE.
Further work into this I've found that the following allows full RING communication on GKE.
distributor:
replicas: 2
extraEnv:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
extraArgs:
- -memberlist.bind-addr=$(MY_POD_IP)
You will want to set this on all loki containers.
Hi Guys, I am also facing the similar issue in my Loki MS setup, I have deployed on AWS EKS cluster v1.20. But functionality wise it is working fine. not sure why we are getting this error in the loki distributor logs.
please suggest should we safely ignore or need to look on it, I have even set the resource limits & requests as well for the distributor container still seeing the errors in logs.
Please help!!
For anyone stumbling into this problem I suggest you take a look at this issue on the Thanos project which explains what was happening in my case.
In my case I was deploying into a private k8s cluster on Azure and incorrectly configured my IP address range which was then being filtered out because it was not in this list.
If you are deploying Loki (or even Tempo) in a distributed mode in a private cluster using memberlist make sure that you are using valid private IP addresses on your subnets.
10.0.0.0/8
100.64.0.0/10
172.16.0.0/12
192.88.99.0/24
192.168.0.0/16
198.18.0.0/15
Hi, please share if anyone having a solution for this.
@unguiculus yes, I can elaborate. As you can see many people have problems even starting using this chart and this very ticket is still open
Personally, I don't have the resources to maintain loki charts right now, so I will not follow your advice about contacting grafana community, I simply will not use these broken charts
we use this chart in production. it does work, you just need to tell loki what addresses to bind to. The solutions are in this issue.
Shouldn't the code be default for all the components for Tempo, Loki and Mimir since they are very likely to suffer from the same issue? I was thinking as part of the template, maybe behind a flag to be able to disable them if required with a lovely reference to this issue or Thanos issue and the list of valid CIDRs
extraEnv:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
extraArgs:
- -memberlist.bind-addr=$(MY_POD_IP)
Loki distributed does not work on GKE private clusters entirely. The gossip network will fail every time.
Did you ever figure out how to get this to run?
Adding the
-memberlist.bind_addr=127.0.0.1
Cli flag to all systems allowed them to start up. Running on GKE.
Further work into this I've found that the following allows full RING communication on GKE.
distributor: replicas: 2 extraEnv: - name: MY_POD_IP valueFrom: fieldRef: fieldPath: status.podIP extraArgs: - -memberlist.bind-addr=$(MY_POD_IP)
You will want to set this on all loki containers.
thank you,This solution solved the problem
Shouldn't the code be default for all the components for Tempo, Loki and Mimir since they are very likely to suffer from the same issue? I was thinking as part of the template, maybe behind a flag to be able to disable them if required with a lovely reference to this issue or Thanos issue and the list of valid CIDRs
extraEnv: - name: MY_POD_IP valueFrom: fieldRef: fieldPath: status.podIP extraArgs: - -memberlist.bind-addr=$(MY_POD_IP)
thanks! you help me.
Helm chart loki-distributor
version 0.74.6
doesn't seem to need this workaround and actually it fails to start complaining that the address is already binded
I think we might be able to close this issue now
Loki distributed does not work on GKE private clusters entirely. The gossip network will fail every time.
Did you ever figure out how to get this to run?
Adding the
-memberlist.bind_addr=127.0.0.1
Cli flag to all systems allowed them to start up. Running on GKE.
Further work into this I've found that the following allows full RING communication on GKE.
distributor: replicas: 2 extraEnv: - name: MY_POD_IP valueFrom: fieldRef: fieldPath: status.podIP extraArgs: - -memberlist.bind-addr=$(MY_POD_IP)
You will want to set this on all loki containers.
I am using Azure CNI Overlay and this worked for me
You may run into this issue if you try to deploy with this method using the latest charts:
reference https://github.com/grafana/loki/issues/10797
this needs to be updated in your values
structuredConfig:
memberlist:
bind_addr: []
Had same problem with EKS cluster. Had to do with what @jmadureira said, the Service IPv4 range
had to be changed.