helm-charts icon indicating copy to clipboard operation
helm-charts copied to clipboard

[loki-distributed] not ready: number of queriers connected to query-frontend is 0

Open kladiv opened this issue 2 years ago • 15 comments

Hello, concerning issue https://github.com/grafana/helm-charts/issues/2028 i still got error below when queryFrontend.replicas is 2:

level=info ts=2023-03-27T09:12:12.813260784Z caller=module_service.go:82 msg=initialising module=cache-generation-loader
level=info ts=2023-03-27T09:12:12.813268147Z caller=module_service.go:82 msg=initialising module=server
level=info ts=2023-03-27T09:12:12.813276593Z caller=module_service.go:82 msg=initialising module=usage-report
level=info ts=2023-03-27T09:12:12.813280821Z caller=module_service.go:82 msg=initialising module=runtime-config
level=info ts=2023-03-27T09:12:12.813430544Z caller=module_service.go:82 msg=initialising module=query-frontend-tripperware
level=info ts=2023-03-27T09:12:12.813443759Z caller=module_service.go:82 msg=initialising module=query-frontend
level=info ts=2023-03-27T09:12:12.813500386Z caller=loki.go:461 msg="Loki started"
level=info ts=2023-03-27T09:12:42.279652042Z caller=frontend.go:342 msg="not ready: number of queriers connected to query-frontend is 0"
level=error ts=2023-03-27T09:12:46.708369563Z caller=reporter.go:203 msg="failed to delete corrupted cluster seed file, deleting it" err="BadRequest: Invalid token.\n\tstatus code: 400, request id: txbc43b8e6a0384bc79bcdb-0064215e0e, host id: txbc43b8e6a0384bc79bcdb-0064215e0e"
level=info ts=2023-03-27T09:12:52.279587151Z caller=frontend.go:342 msg="not ready: number of queriers connected to query-frontend is 0"
level=info ts=2023-03-27T09:13:02.280118627Z caller=frontend.go:342 msg="not ready: number of queriers connected to query-frontend is 0"
level=info ts=2023-03-27T09:13:12.280012774Z caller=frontend.go:342 msg="not ready: number of queriers connected to query-frontend is 0"
level=info ts=2023-03-27T09:13:22.280004704Z caller=frontend.go:342 msg="not ready: number of queriers connected to query-frontend is 0"

I checked and the headless service seems present:

 $ kubectl -n logging-new get svc
NAME                                                TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE
loki-new-loki-distributed-compactor                 ClusterIP   10.43.20.194    <none>        3100/TCP                     19m
loki-new-loki-distributed-distributor               ClusterIP   10.43.167.185   <none>        3100/TCP,9095/TCP            18m
loki-new-loki-distributed-gateway                   ClusterIP   10.43.123.201   <none>        80/TCP                       18m
loki-new-loki-distributed-ingester                  ClusterIP   10.43.53.24     <none>        3100/TCP,9095/TCP            19m
loki-new-loki-distributed-ingester-headless         ClusterIP   None            <none>        3100/TCP,9095/TCP            19m
loki-new-loki-distributed-memberlist                ClusterIP   None            <none>        7946/TCP                     19m
loki-new-loki-distributed-querier                   ClusterIP   10.43.61.180    <none>        3100/TCP,9095/TCP            18m
loki-new-loki-distributed-querier-headless          ClusterIP   None            <none>        3100/TCP,9095/TCP            19m
loki-new-loki-distributed-query-frontend            ClusterIP   10.43.175.74    <none>        3100/TCP,9095/TCP,9096/TCP   18m
loki-new-loki-distributed-query-frontend-headless   ClusterIP   None            <none>        3100/TCP,9095/TCP,9096/TCP   19m

Helm chart version is 0.69.9

Why i'm still got this?

Could it be 'caused by the spec below in values.yaml that should point the headless endpoint? https://github.com/grafana/helm-charts/blob/main/charts/loki-distributed/values.yaml#L181

Thank you

kladiv avatar Mar 27 '23 09:03 kladiv

in my case, When the querier starts, a connection is made to the query frontend.

However, since the querier accesses through the service of the query frontend, it seems that 1:1 mapping may not be possible if there are multiple query frontends.

So, if you increase the querier or decrease the query frontend, it seems that the querier and the query frontend are connected.

eg) query-frontend : "1" , querier : "3"

heesuk-ahn avatar Mar 28 '23 06:03 heesuk-ahn

I guess it's not related to replicas ratio. I guess it should be related to this model: https://grafana.com/docs/loki/latest/configuration/query-frontend/#grpc-mode-pull-model

kladiv avatar Mar 28 '23 21:03 kladiv

I'm also getting trouble with this. It seems that publishNotReadyAddresses is missing from the querier headless Service. May this matter?

kworkbee avatar Mar 29 '23 09:03 kworkbee

Same here, with 2 querier and 2 query-frontend pods. As @kladiv mentioned, I changed https://github.com/grafana/helm-charts/blob/main/charts/loki-distributed/values.yaml#L181 to the headless-service and then it worked. Not sure if that's the solution though, or if side-effects can be expected.

sjentzsch avatar Mar 29 '23 23:03 sjentzsch

In my case, it seems that the change to the headless service does not work normally. Querier has four replicas and query frontend has two replicas, each with an autoscaling option enabled, resulting in a crashloopbackoff (distributor / ingester / querier / queryFrontend).

kworkbee avatar Mar 30 '23 01:03 kworkbee

I'm also facing this error.

If I disable the queryScheduler it works fine

rotarur avatar Apr 18 '23 09:04 rotarur

We're seeing this as well..

diranged avatar Apr 19 '23 17:04 diranged

Please check latest release. Frontend address was adjusted in loki-distributed-0.69.13: https://github.com/grafana/helm-charts/commit/3829417e0d113d24ea82ff9f0c6c631d20f95822 I no longer see this issue in helm-loki-5.2.0 chart.

LukaszRacon avatar Apr 20 '23 14:04 LukaszRacon

We also deployed the 5.2.0 Helm chart to some of our environments today and the issue appears to be resolved. :+1:

9numbernine9 avatar Apr 21 '23 16:04 9numbernine9

I encountered the same issue in mimir-distributed helm chart and I resolved it by configuring frontend_worker.scheduler_address parameter. More info here: https://grafana.com/docs/mimir/latest/references/configuration-parameters/#frontend_worker

dorkamotorka avatar Jun 11 '23 17:06 dorkamotorka

Using s3 all the way solved this for me. Using filesystem with loki-distributed did make some weird problems

sojjan1337 avatar Jan 17 '24 13:01 sojjan1337

I have the same error with grafana/loki (simple scalable) deployment Helm Chart version 5.2.0. It deploys 3 loki-read pods and only one gives that error, the other two is happy.

Edit: After restarting that failing pod, it becomes healthy.

snk-actian avatar Jan 18 '24 20:01 snk-actian

I am having this same issue. My environment is running on istio with mutual tls enabled. If I disable mutual tls everything works fine.

❯ helm ls -n loki
NAME    NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                   APP VERSION
loki    loki            2               2024-03-21 15:57:58.90046894 -0400 EDT  deployed        loki-distributed-0.78.3 2.9.4

ghost avatar Mar 22 '24 12:03 ghost

I am having same issue with helm chart "[email protected]"

prasadrajesh avatar Oct 18 '24 08:10 prasadrajesh

I sometimes have this issue with loki-read Pods. The endlessley report something like:

not ready: number of schedulers this worker is connected to is 0

When I delete the pods, and new ones get created, the get up and running in no time.

I think that either, Pods should be implemented more self-healing or they should be restarted after some timeout.

Jeansen avatar Apr 10 '25 18:04 Jeansen

Same problem here.

I use loki-distributed with version 0.80.5 and loki-query-frontend works without problem when it starts, but after a few days it starts to restart continuously showing the error not ready: number of queriers connected to query-frontend is 0 but no other config has changed.

If I delete the pod and a new one is created the error is solved until a few days pass and the scenario repeats again.

Anyone found any news? I also use istio, could it be related?

Thanks!

amanda-fernandez avatar Jun 25 '25 07:06 amanda-fernandez

Also started encountering this in 0.80.5.

Scaled down to one query-frontend for now. Do frontends not have some form of memberlist configuration? How do the frontends load balance between each other?

SB-MFJ avatar Jul 17 '25 14:07 SB-MFJ