helm-charts [loki-simple-scalable] Helm Chart and Grafana break after updating to 1.4.3

Hi all,

I had Loki deployed using the loki-simple-scalable 0.4.0 Helm Chart, and it was working with Grafana just fine. I've recently tried to update my dev environment to 1.4.3 from 0.4.0, and am running into issues. I assume there is a breaking change somewhere in such a big version jump, but I can't see any change logs or release notes (am I being blind?) suggesting what this might be.

Since going to 1.4.3 I get either: 502: bad gateway or Loki: Internal Server Error. 500. too many unhealthy instances in the ring from Grafana.

I fixed the 502 by changing from loki-loki-simple-scalable-gateway.loki.svc.cluster.local for Grafana and loki-loki-simple-scalable-memberlist.loki.svc.cluster.local to loki-gateway.loki.svc.cluster.local and loki-memberlist.loki.svc.cluster.local respectively. (As well as updating anything using loki-loki-simple-scalable-* to just be loki-*.

Since doing that, I am now getting the 500 - too many unhealthy instances message. Is there something obvious I should be changing? I'm guessing something changed between 0.4.0 and 1.0.0, but I don't see anything meaningful update in the readme.

Below is our relevant Helm Release config and log output:

Grafana (included for reference):

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: grafana
  namespace: grafana
  labels: 
    kustomize.toolkit.fluxcd.io/substitute: disabled #to stop is expanding out "$${__value.raw}"" in the loki config  
spec:
  values:
    datasources:
      datasources.yaml:
          - name: Loki
            type: loki
            uid: Loki
            url: http://loki-gateway.loki.svc.cluster.local:3100 
            isDefault: false
            jsonData:
              maxLines: 1000
              derivedFields:
                - datasourceUid: Tempo
                  matcherRegex: "TraceId:(.+?),"
                  name: TraceID
                  url: "$${__value.raw}"

And here is Loki:

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: loki
  namespace: loki
spec:
  chart:
    spec:
      chart: loki-simple-scalable
      version: 1.4.3
      sourceRef:
        kind: HelmRepository
        name: <redacted>-helm-repo
        namespace: flux-system
  interval: 1m
  values:
    serviceMonitor:
      enabled: true
    gateway:
      image:
        registry: docker.internal.<redacted>.net
        repository: nginxinc/nginx-unprivileged
      service:
        port: 3100
      nginxConfig:
        serverSnippet: |
           location ~ /loki/api/v1/alerts.* {
            proxy_pass       http://loki-read.loki.svc.cluster.local:3100$request_uri;
          }

          location ~ /prometheus/api/v1/rules.* {
            proxy_pass       http://loki-read.loki.svc.cluster.local:3100$request_uri;
          }
        httpSnippet: |
          client_max_body_size 0;
    write:
      repository: docker.internal.<redacted>.net/grafana/loki
      replicas: 1
      resources:
        limits:
          memory: "4Gi"
      persistence:
        size: 10Gi
        storageClass: gp3 #this is the default, but calling it out explicity so it can be overriden for dev
    read:
      replicas: 3
      persistence:
        size: 10Gi
        storageClass: gp3
      extraVolumeMounts:
        - name: loki-rules
          mountPath: /rules/fake
        - name: loki-rules-tmp
          mountPath: /tmp/scratch
        - name: loki-tmp
          mountPath: /tmp/loki-tmp
      extraVolumes:
        - name: loki-rules
          configMap:
            name: loki-alerting-rules
        - name: loki-rules-tmp
          emptyDir: {}
        - name: loki-tmp
          emptyDir: {}    

    loki: 
      image:
        registry: docker.internal.<redacted>.net
        repository: grafana/loki
        tag: 2.5.0
      structuredConfig:
        memberlist:
         join_members:
            - loki-memberlist.loki.svc.cluster.local
        auth_enabled: false
        server:
          http_listen_port: 3100
          log_level: info
          grpc_server_max_recv_msg_size: 104857600
          grpc_server_max_send_msg_size: 104857600
        schema_config: 
          configs:
          - from: "2020-11-04"
            store: boltdb-shipper
            object_store: aws
            schema: v11
            index:
              prefix: index_
              period: 24h
        storage_config: 
          boltdb_shipper:
            active_index_directory: /var/loki/index
            cache_location: /var/loki/boltdb-cache
            shared_store: s3
        ruler:
          storage:
            type: local
            local:
              directory: /rules
          rule_path: /tmp/scratch
          enable_api: true
          alertmanager_url: kube-prometheues-stack-kub-alertmanager-0.kube-prometheus-stack.svc.cluster.local, kube-prometheues-stack-kub-alertmanager-1.kube-prometheus-stack.svc.cluster.local, kube-prometheues-stack-kub-alertmanager-2.kube-prometheus-stack.svc.cluster.local
        limits_config:
          enforce_metric_name: false
          reject_old_samples: true
          reject_old_samples_max_age: 168h
          ingestion_rate_mb: 30
          ingestion_burst_size_mb: 16
          retention_period: 336h
          max_query_lookback: 336h
          max_streams_per_user: 0
          max_global_streams_per_user: 0
        compactor:
          working_directory: /var/loki/boltdb-shipper-compactor
          shared_store: filesystem
          retention_enabled: true
        chunk_store_config:
          chunk_cache_config:
            enable_fifocache: true
            fifocache:
              max_size_bytes: 500MB
        query_range:
          results_cache:
            cache:
              enable_fifocache: true
              fifocache:
                max_size_bytes: 500MB
        analytics:
          reporting_enabled: false
        ingester:
          max_chunk_age: 1h
          chunk_encoding: snappy

When I spin these up together, I get: Loki: Internal Server Error. 500. too many unhealthy instances in the ring if I Test the datasource in Grafana.

Here are the Loki pod logs:

kubectl get pods -n loki
NAME                            READY   STATUS    RESTARTS   AGE
loki-gateway-5d585556bc-tl2wb   1/1     Running   0          18m
loki-read-0                     1/1     Running   0          22m
loki-write-0                    1/1     Running   0          22

Gateway:

kubectl logs -n loki loki-gateway-5d585556bc-tl2wb
/docker-entrypoint.sh: No files found in /docker-entrypoint.d/, skipping configuration
10.244.0.1 - - [27/Jun/2022:14:45:14 +0000]  200 "GET / HTTP/1.1" 2 "-" "kube-probe/1.22" "-"

10.244.0.1 - - [27/Jun/2022:14:49:44 +0000]  200 "GET / HTTP/1.1" 2 "-" "kube-probe/1.22" "-"
2022/06/27 14:49:52 [error] 13#13: *29 open() "/etc/nginx/html/api/v1/status/buildinfo" failed (2: No such file or directory), client: 10.244.0.47, server: , request: "GET /api/v1/status/buildinfo HTTP/1.1", host: "loki-gateway.loki.svc.cluster.local:3100"
10.244.0.47 - - [27/Jun/2022:14:49:52 +0000]  404 "GET /api/v1/status/buildinfo HTTP/1.1" 154 "-" "Grafana/8.5.0" "10.244.0.14, 10.244.0.14"
10.244.0.47 - - [27/Jun/2022:14:49:52 +0000]  200 "GET /prometheus/api/v1/rules HTTP/1.1" 2882 "-" "Grafana/8.5.0" "-"
10.244.0.47 - - [27/Jun/2022:14:49:52 +0000]  400 "GET /api/prom/rules/test/test HTTP/1.1" 45 "-" "Grafana/8.5.0" "-"
10.244.0.47 - - [27/Jun/2022:14:49:52 +0000]  200 "GET /prometheus/api/v1/rules HTTP/1.1" 2882 "-" "Grafana/8.5.0" "-"
10.244.0.1 - - [27/Jun/2022:14:49:54 +0000]  200 "GET / HTTP/1.1" 2 "-" "kube-probe/1.22" "-"
10.244.0.47 - - [27/Jun/2022:14:49:58 +0000]  500 "GET /loki/api/v1/label?start=1656340798905000000 HTTP/1.1" 41 "-" "Grafana/8.5.0" "10.244.0.14, 10.244.0.14"

All of these lines are repeated several times.

Write:

 kubectl logs -n loki loki-write-0
level=info ts=2022-06-27T14:39:29.420766Z caller=main.go:106 msg="Starting Loki" version="(version=2.5.0, branch=HEAD, revision=2d9d0ee23)"
level=info ts=2022-06-27T14:39:29.4210416Z caller=server.go:260 http=[::]:3100 grpc=[::]:9095 msg="server listening on addresses"
level=warn ts=2022-06-27T14:39:29.4212857Z caller=experimental.go:20 msg="experimental feature in use" feature="In-memory (FIFO) cache"
level=warn ts=2022-06-27T14:39:29.421421Z caller=experimental.go:20 msg="experimental feature in use" feature="In-memory (FIFO) cache"
level=info ts=2022-06-27T14:39:29.421723Z caller=shipper_index_client.go:111 msg="starting boltdb shipper in 2 mode"
level=info ts=2022-06-27T14:39:29.4217657Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:39:29.4237756Z caller=memberlist_client.go:394 msg="Using memberlist cluster node name" name=loki-write-0-fe8583fa
level=info ts=2022-06-27T14:39:29.4272028Z caller=module_service.go:64 msg=initialising module=store
level=info ts=2022-06-27T14:39:29.4273979Z caller=module_service.go:64 msg=initialising module=server
level=info ts=2022-06-27T14:39:29.4274158Z caller=module_service.go:64 msg=initialising module=memberlist-kv
level=info ts=2022-06-27T14:39:29.427578Z caller=module_service.go:64 msg=initialising module=ring
level=info ts=2022-06-27T14:39:29.4276925Z caller=ring.go:272 msg="ring doesn't exist in KV store yet"
level=info ts=2022-06-27T14:39:29.4277553Z caller=module_service.go:64 msg=initialising module=ingester
level=info ts=2022-06-27T14:39:29.4277866Z caller=ingester.go:398 msg="recovering from checkpoint"
level=info ts=2022-06-27T14:39:29.4278962Z caller=module_service.go:64 msg=initialising module=distributor
level=info ts=2022-06-27T14:39:29.4279556Z caller=ingester.go:414 msg="recovered WAL checkpoint recovery finished" elapsed=174.5µs errors=false
level=info ts=2022-06-27T14:39:29.4279841Z caller=ingester.go:420 msg="recovering from WAL"
level=info ts=2022-06-27T14:39:29.4280047Z caller=lifecycler.go:546 msg="not loading tokens from file, tokens file path is empty"
level=info ts=2022-06-27T14:39:29.4281001Z caller=lifecycler.go:575 msg="instance not found in ring, adding with no tokens" ring=distributor
level=info ts=2022-06-27T14:39:29.4281252Z caller=ring.go:272 msg="ring doesn't exist in KV store yet"
level=info ts=2022-06-27T14:39:29.4281094Z caller=ingester.go:436 msg="WAL segment recovery finished" elapsed=328.3µs errors=false
level=info ts=2022-06-27T14:39:29.4282506Z caller=ingester.go:384 msg="closing recoverer"
ts=2022-06-27T14:39:29.4282719Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve loki-memberlist.loki.svc.cluster.local: lookup loki-memberlist.loki.svc.cluster.local on 10.96.0.10:53: no such host"
level=info ts=2022-06-27T14:39:29.4283205Z caller=ingester.go:392 msg="WAL recovery finished" time=538.8µs
level=info ts=2022-06-27T14:39:29.428355Z caller=lifecycler.go:422 msg="auto-joining cluster after timeout" ring=distributor
level=info ts=2022-06-27T14:39:29.4283883Z caller=lifecycler.go:546 msg="not loading tokens from file, tokens file path is empty"
level=info ts=2022-06-27T14:39:29.428426Z caller=loki.go:372 msg="Loki started"
level=info ts=2022-06-27T14:39:29.4284394Z caller=wal.go:156 msg=started component=wal
level=info ts=2022-06-27T14:39:29.4284446Z caller=lifecycler.go:575 msg="instance not found in ring, adding with no tokens" ring=ingester
level=info ts=2022-06-27T14:39:29.4285516Z caller=lifecycler.go:422 msg="auto-joining cluster after timeout" ring=ingester
ts=2022-06-27T14:39:30.548897Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve loki-memberlist.loki.svc.cluster.local: lookup loki-memberlist.loki.svc.cluster.local on 10.96.0.10:53: no such host"
ts=2022-06-27T14:39:33.4948198Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve loki-memberlist.loki.svc.cluster.local: lookup loki-memberlist.loki.svc.cluster.local on 10.96.0.10:53: no such host"
ts=2022-06-27T14:39:40.1071028Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve loki-memberlist.loki.svc.cluster.local: lookup loki-memberlist.loki.svc.cluster.local on 10.96.0.10:53: no such host"
ts=2022-06-27T14:39:50.9613016Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve loki-memberlist.loki.svc.cluster.local: lookup loki-memberlist.loki.svc.cluster.local on 10.96.0.10:53: no such host"
ts=2022-06-27T14:40:12.0220944Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve loki-memberlist.loki.svc.cluster.local: lookup loki-memberlist.loki.svc.cluster.local on 10.96.0.10:53: no such host"
level=info ts=2022-06-27T14:40:29.4218763Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:41:08.5736488Z caller=memberlist_client.go:542 msg="joined memberlist cluster" reached_nodes=2
level=info ts=2022-06-27T14:41:29.4220935Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:42:29.4228876Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:43:29.4219004Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:44:29.422408Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:44:29.428556Z caller=checkpoint.go:615 msg="starting checkpoint"
level=info ts=2022-06-27T14:44:29.4297238Z caller=checkpoint.go:340 msg="attempting checkpoint for" dir=/var/loki/wal/checkpoint.000004
level=info ts=2022-06-27T14:44:29.4342957Z caller=checkpoint.go:502 msg="atomic checkpoint finished" old=/var/loki/wal/checkpoint.000004.tmp new=/var/loki/wal/checkpoint.000004
level=info ts=2022-06-27T14:45:29.4218623Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:46:29.4225586Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:47:29.4227801Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:48:29.4219661Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:49:29.4224932Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:49:29.4291595Z caller=checkpoint.go:615 msg="starting checkpoint"
level=info ts=2022-06-27T14:49:29.4292888Z caller=checkpoint.go:340 msg="attempting checkpoint for" dir=/var/loki/wal/checkpoint.000005
level=info ts=2022-06-27T14:49:29.4346005Z caller=checkpoint.go:502 msg="atomic checkpoint finished" old=/var/loki/wal/checkpoint.000005.tmp new=/var/loki/wal/checkpoint.000005
level=info ts=2022-06-27T14:50:29.4218727Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:51:29.4219922Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:52:29.4219176Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:53:29.4221026Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:54:29.4223405Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:54:29.4284884Z caller=checkpoint.go:615 msg="starting checkpoint"
level=info ts=2022-06-27T14:54:29.428659Z caller=checkpoint.go:340 msg="attempting checkpoint for" dir=/var/loki/wal/checkpoint.000006
level=info ts=2022-06-27T14:54:29.4328397Z caller=checkpoint.go:502 msg="atomic checkpoint finished" old=/var/loki/wal/checkpoint.000006.tmp new=/var/loki/wal/checkpoint.000006
level=info ts=2022-06-27T14:55:29.4219931Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:56:29.422355Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:57:29.4219955Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:58:29.4226244Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:59:29.4218699Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:59:29.428517Z caller=checkpoint.go:615 msg="starting checkpoint"
level=info ts=2022-06-27T14:59:29.4286635Z caller=checkpoint.go:340 msg="attempting checkpoint for" dir=/var/loki/wal/checkpoint.000007
level=info ts=2022-06-27T14:59:29.4333984Z caller=checkpoint.go:502 msg="atomic checkpoint finished" old=/var/loki/wal/checkpoint.000007.tmp new=/var/loki/wal/checkpoint.000007
level=info ts=2022-06-27T15:00:29.4225507Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T15:01:29.4224497Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T15:02:29.4222936Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T15:03:29.4219203Z caller=table_manager.go:169 msg="uploading tables"
joe@/mnt/c/Users/JoeAlford/repos/kate/flux/flux-k8s-multi-cluster/kind (feature/KATE-418-loki-snappy-compression) $ kubectl exec -it -n loki loki-write-0 -- sh
/ $ nslookup loki-memberlist.loki.svc.cluster.local
Server:         10.96.0.10
Address:        10.96.0.10:53


Name:   loki-memberlist.loki.svc.cluster.local
Address: 10.244.0.44
Name:   loki-memberlist.loki.svc.cluster.local
Address: 10.244.0.45

/ $

Read:

kubectl logs -n loki loki-read-0
level=info ts=2022-06-27T14:39:28.7625301Z caller=main.go:106 msg="Starting Loki" version="(version=2.5.0, branch=HEAD, revision=2d9d0ee23)"
level=info ts=2022-06-27T14:39:28.7628752Z caller=server.go:260 http=[::]:3100 grpc=[::]:9095 msg="server listening on addresses"
level=info ts=2022-06-27T14:39:28.7639932Z caller=memberlist_client.go:394 msg="Using memberlist cluster node name" name=loki-read-0-8e2ba528
level=warn ts=2022-06-27T14:39:28.7649674Z caller=experimental.go:20 msg="experimental feature in use" feature="In-memory (FIFO) cache"
level=info ts=2022-06-27T14:39:28.7652472Z caller=shipper_index_client.go:111 msg="starting boltdb shipper in 1 mode"
level=info ts=2022-06-27T14:39:28.7662616Z caller=worker.go:112 msg="Starting querier worker using query-scheduler and scheduler ring for addresses"
level=info ts=2022-06-27T14:39:28.7674493Z caller=mapper.go:46 msg="cleaning up mapped rules directory" path=/tmp/scratch
ts=2022-06-27T14:39:28.7689852Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve loki-memberlist.loki.svc.cluster.local: lookup loki-memberlist.loki.svc.cluster.local on 10.96.0.10:53: no such host"
level=info ts=2022-06-27T14:39:28.7699166Z caller=module_service.go:64 msg=initialising module=memberlist-kv
level=info ts=2022-06-27T14:39:28.769929Z caller=module_service.go:64 msg=initialising module=server
level=info ts=2022-06-27T14:39:28.7701931Z caller=module_service.go:64 msg=initialising module=compactor
level=info ts=2022-06-27T14:39:28.7702131Z caller=module_service.go:64 msg=initialising module=query-frontend-tripperware
level=info ts=2022-06-27T14:39:28.770214Z caller=module_service.go:64 msg=initialising module=ring
level=info ts=2022-06-27T14:39:28.7702001Z caller=module_service.go:64 msg=initialising module=query-scheduler
level=info ts=2022-06-27T14:39:28.7702876Z caller=ring.go:272 msg="ring doesn't exist in KV store yet"
level=info ts=2022-06-27T14:39:28.7703295Z caller=ring.go:272 msg="ring doesn't exist in KV store yet"
level=info ts=2022-06-27T14:39:28.7703453Z caller=basic_lifecycler.go:260 msg="instance not found in the ring" instance=loki-read-0 ring=compactor
level=info ts=2022-06-27T14:39:28.7703546Z caller=basic_lifecycler.go:260 msg="instance not found in the ring" instance=loki-read-0 ring=scheduler
level=info ts=2022-06-27T14:39:28.7703634Z caller=basic_lifecycler_delegates.go:63 msg="not loading tokens from file, tokens file path is empty"
level=info ts=2022-06-27T14:39:28.7703672Z caller=basic_lifecycler_delegates.go:63 msg="not loading tokens from file, tokens file path is empty"
level=info ts=2022-06-27T14:39:28.7706609Z caller=compactor.go:264 msg="waiting until compactor is JOINING in the ring"
level=info ts=2022-06-27T14:39:28.7706828Z caller=compactor.go:268 msg="compactor is JOINING in the ring"
level=info ts=2022-06-27T14:39:28.7706944Z caller=ring.go:272 msg="ring doesn't exist in KV store yet"
level=info ts=2022-06-27T14:39:28.7707242Z caller=scheduler.go:610 msg="waiting until scheduler is JOINING in the ring"
level=info ts=2022-06-27T14:39:28.7707408Z caller=module_service.go:64 msg=initialising module=ingester-querier
level=info ts=2022-06-27T14:39:28.7707431Z caller=scheduler.go:614 msg="scheduler is JOINING in the ring"
level=info ts=2022-06-27T14:39:28.770811Z caller=module_service.go:64 msg=initialising module=store
level=info ts=2022-06-27T14:39:28.7708208Z caller=module_service.go:64 msg=initialising module=ruler
level=info ts=2022-06-27T14:39:28.7708341Z caller=ruler.go:450 msg="ruler up and running"
level=info ts=2022-06-27T14:39:28.7720462Z caller=mapper.go:154 msg="updating rule file" file=/tmp/scratch/fake/loki.yaml
level=info ts=2022-06-27T14:39:28.772156Z caller=mapper.go:154 msg="updating rule file" file=/tmp/scratch/fake/srs.yaml
level=info ts=2022-06-27T14:39:29.7714738Z caller=scheduler.go:624 msg="waiting until scheduler is ACTIVE in the ring"
level=info ts=2022-06-27T14:39:29.7715699Z caller=compactor.go:278 msg="waiting until compactor is ACTIVE in the ring"
level=info ts=2022-06-27T14:39:29.7716054Z caller=compactor.go:282 msg="compactor is ACTIVE in the ring"
level=info ts=2022-06-27T14:39:29.8847963Z caller=scheduler.go:628 msg="scheduler is ACTIVE in the ring"
level=info ts=2022-06-27T14:39:29.8848902Z caller=module_service.go:64 msg=initialising module=querier
level=info ts=2022-06-27T14:39:29.8849187Z caller=module_service.go:64 msg=initialising module=query-frontend
level=info ts=2022-06-27T14:39:29.8850374Z caller=loki.go:372 msg="Loki started"
ts=2022-06-27T14:39:30.1704537Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve loki-memberlist.loki.svc.cluster.local: lookup loki-memberlist.loki.svc.cluster.local on 10.96.0.10:53: no such host"
ts=2022-06-27T14:39:32.1798807Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve loki-memberlist.loki.svc.cluster.local: lookup loki-memberlist.loki.svc.cluster.local on 10.96.0.10:53: no such host"
level=info ts=2022-06-27T14:39:32.8853412Z caller=scheduler.go:661 msg="this scheduler is in the ReplicationSet, will now accept requests."
level=info ts=2022-06-27T14:39:32.8853813Z caller=worker.go:209 msg="adding connection" addr=10.244.0.44:9095
level=info ts=2022-06-27T14:39:34.7725692Z caller=compactor.go:324 msg="this instance has been chosen to run the compactor, starting compactor"
level=info ts=2022-06-27T14:39:34.7726541Z caller=compactor.go:351 msg="waiting 10m0s for ring to stay stable and previous compactions to finish before starting compactor"
ts=2022-06-27T14:39:38.8869391Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve loki-memberlist.loki.svc.cluster.local: lookup loki-memberlist.loki.svc.cluster.local on 10.96.0.10:53: no such host"
level=info ts=2022-06-27T14:39:39.8852869Z caller=frontend_scheduler_worker.go:101 msg="adding connection to scheduler" addr=10.244.0.44:9095
ts=2022-06-27T14:39:47.2264647Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve loki-memberlist.loki.svc.cluster.local: lookup loki-memberlist.loki.svc.cluster.local on 10.96.0.10:53: no such host"
level=info ts=2022-06-27T14:39:49.3050763Z caller=metrics.go:122 component=ruler org_id=fake latency=fast query="avg by(cluster)(rate({app=\"acp-service-srs\"} |= \"mos\" | json | State_mos<4[10s]))" query_type=metric range_type=instant length=0s step=0s duration=152.4µs status=500 limit=0 returned_lines=0 throughput=0B total_bytes=0B queue_time=0s subqueries=1
level=warn ts=2022-06-27T14:39:49.3051379Z caller=manager.go:610 user=fake group=Loki msg="Evaluating rule failed" rule="alert: SimpleRecordingServiceMosScoreLessThan4\nexpr: avg by(cluster)(rate({app=\"acp-service-srs\"} |= \"mos\" | json | State_mos<4[10s]))\n" err="empty ring"
level=info ts=2022-06-27T14:39:49.3052792Z caller=metrics.go:122 component=ruler org_id=fake latency=fast query="avg by(cluster)(rate({app=\"acp-service-srs\"} |= \"mos\" | json | State_PacketLossPercentage>0[10s]))" query_type=metric range_type=instant length=0s step=0s duration=49.5µs status=500 limit=0 returned_lines=0 throughput=0B total_bytes=0B queue_time=0s subqueries=1
level=warn ts=2022-06-27T14:39:49.3053241Z caller=manager.go:610 user=fake group=Loki msg="Evaluating rule failed" rule="alert: SimpleRecordingServicePacketLossPercentageGreaterThan0\nexpr: avg by(cluster)(rate({app=\"acp-service-srs\"} |= \"mos\" | json | State_PacketLossPercentage>0[10s]))\n" err="empty ring"
level=info ts=2022-06-27T14:39:49.3055688Z caller=metrics.go:122 component=ruler org_id=fake latency=fast query="(count_over_time({app=\"acp-service-srs\"} |= \"Error\" != \"Warning\"[1m]) > 1)" query_type=metric range_type=instant length=0s step=0s duration=133.1µs status=500 limit=0 returned_lines=0 throughput=0B total_bytes=0B queue_time=0s subqueries=1
level=warn ts=2022-06-27T14:39:49.3056162Z caller=manager.go:610 user=fake group=Loki msg="Evaluating rule failed" rule="alert: SimpleRecordingServiceErrorsAndWarningsCount\nexpr: (count_over_time({app=\"acp-service-srs\"} |= \"Error\" != \"Warning\"[1m]) > 1)\nfor: 10s\n" err="empty ring"
ts=2022-06-27T14:40:03.3543264Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve loki-memberlist.loki.svc.cluster.local: lookup loki-memberlist.loki.svc.cluster.local on 10.96.0.10:53: no such host"
level=info ts=2022-06-27T14:40:21.7312888Z caller=metrics.go:122 component=ruler org_id=fake latency=fast query="(count_over_time({app=~\"prometheus\"}[1m]) > 0)" query_type=metric range_type=instant length=0s step=0s duration=125.499µs status=500 limit=0 returned_lines=0 throughput=0B total_bytes=0B queue_time=0s subqueries=1
level=warn ts=2022-06-27T14:40:21.7313421Z caller=manager.go:610 user=fake group=Loki msg="Evaluating rule failed" rule="alert: LokiAlwaysFailingTest\nexpr: (count_over_time({app=~\"prometheus\"}[1m]) > 0)\nlabels:\n  severity: critical\nannotations:\n  message: |\n    This alert should always be firing in dev\n" err="empty ring"
level=info ts=2022-06-27T14:40:21.7315089Z caller=metrics.go:122 component=ruler org_id=fake latency=fast query="(absent_over_time({cluster=~\".+\"}[1m]) == 1)" query_type=metric range_type=instant length=0s step=0s duration=53.9µs status=500 limit=0 returned_lines=0 throughput=0B total_bytes=0B queue_time=0s subqueries=1
level=warn ts=2022-06-27T14:40:21.7315527Z caller=manager.go:610 user=fake group=Loki msg="Evaluating rule failed" rule="alert: LokiNoLogsFoundForAnyCluster\nexpr: (absent_over_time({cluster=~\".+\"}[1m]) == 1)\nfor: 5m\nlabels:\n  severity: critical\nannotations:\n  message: |\n    Loki is reporting no logs received from any cluster in the last 5 minutes\n" err="empty ring"
level=info ts=2022-06-27T14:40:21.7317404Z caller=metrics.go:122 component=ruler org_id=fake latency=fast query="(rate({cluster=~\".+\", job=\"kube-prometheus-stack/thanos\"} |= \"update of node failed\"[1m]) > 0.01)" query_type=metric range_type=instant length=0s step=0s duration=91.2µs status=500 limit=0 returned_lines=0 throughput=0B total_bytes=0B queue_time=0s subqueries=1
level=warn ts=2022-06-27T14:40:21.7317749Z caller=manager.go:610 user=fake group=Loki msg="Evaluating rule failed" rule="alert: ThanosQueryNoData\nexpr: (rate({cluster=~\".+\", job=\"kube-prometheus-stack/thanos\"} |= \"update of node\n  failed\"[1m]) > 0.01)\nlabels:\n  severity: critical\nannotations:\n  message: |\n    Thanos has failed to update nodes/collect logs within the last minute\n" err="empty ring"
level=info ts=2022-06-27T14:40:49.3050531Z caller=metrics.go:122 component=ruler org_id=fake latency=fast query="avg by(cluster)(rate({app=\"acp-service-srs\"} |= \"mos\" | json | State_mos<4[10s]))" query_type=metric range_type=instant length=0s step=0s duration=126.699µs status=500 limit=0 returned_lines=0 throughput=0B total_bytes=0B queue_time=0s subqueries=1
level=warn ts=2022-06-27T14:40:49.3051283Z caller=manager.go:610 user=fake group=Loki msg="Evaluating rule failed" rule="alert: SimpleRecordingServiceMosScoreLessThan4\nexpr: avg by(cluster)(rate({app=\"acp-service-srs\"} |= \"mos\" | json | State_mos<4[10s]))\n" err="empty ring"
level=info ts=2022-06-27T14:40:49.3053076Z caller=metrics.go:122 component=ruler org_id=fake latency=fast query="avg by(cluster)(rate({app=\"acp-service-srs\"} |= \"mos\" | json | State_PacketLossPercentage>0[10s]))" query_type=metric range_type=instant length=0s step=0s duration=64.6µs status=500 limit=0 returned_lines=0 throughput=0B total_bytes=0B queue_time=0s subqueries=1
level=warn ts=2022-06-27T14:40:49.3053434Z caller=manager.go:610 user=fake group=Loki msg="Evaluating rule failed" rule="alert: SimpleRecordingServicePacketLossPercentageGreaterThan0\nexpr: avg by(cluster)(rate({app=\"acp-service-srs\"} |= \"mos\" | json | State_PacketLossPercentage>0[10s]))\n" err="empty ring"
level=info ts=2022-06-27T14:40:49.3054569Z caller=metrics.go:122 component=ruler org_id=fake latency=fast query="(count_over_time({app=\"acp-service-srs\"} |= \"Error\" != \"Warning\"[1m]) > 1)" query_type=metric range_type=instant length=0s step=0s duration=44.9µs status=500 limit=0 returned_lines=0 throughput=0B total_bytes=0B queue_time=0s subqueries=1
level=warn ts=2022-06-27T14:40:49.3054907Z caller=manager.go:610 user=fake group=Loki msg="Evaluating rule failed" rule="alert: SimpleRecordingServiceErrorsAndWarningsCount\nexpr: (count_over_time({app=\"acp-service-srs\"} |= \"Error\" != \"Warning\"[1m]) > 1)\nfor: 10s\n" err="empty ring"
level=info ts=2022-06-27T14:40:56.6814446Z caller=memberlist_client.go:542 msg="joined memberlist cluster" reached_nodes=2
level=info ts=2022-06-27T14:41:21.7320922Z caller=metrics.go:122 component=ruler org_id=fake latency=fast query="(count_over_time({app=~\"prometheus\"}[1m]) > 0)" query_type=metric range_type=instant length=0s step=0s duration=250.999µs status=500 limit=0 returned_lines=0 throughput=0B total_bytes=0B queue_time=0s subqueries=1
level=warn ts=2022-06-27T14:41:21.732152Z caller=manager.go:610 user=fake group=Loki msg="Evaluating rule failed" rule="alert: LokiAlwaysFailingTest\nexpr: (count_over_time({app=~\"prometheus\"}[1m]) > 0)\nlabels:\n  severity: critical\nannotations:\n  message: |\n    This alert should always be firing in dev\n" err="too many unhealthy instances in the ring"
level=info ts=2022-06-27T14:41:21.7324097Z caller=metrics.go:122 component=ruler org_id=fake latency=fast query="(absent_over_time({cluster=~\".+\"}[1m]) == 1)" query_type=metric range_type=instant length=0s step=0s duration=72.5µs status=500 limit=0 returned_lines=0 throughput=0B total_bytes=0B queue_time=0s subqueries=1
level=warn ts=2022-06-27T14:41:21.7324569Z caller=manager.go:610 user=fake group=Loki msg="Evaluating rule failed" rule="alert: LokiNoLogsFoundForAnyCluster\nexpr: (absent_over_time({cluster=~\".+\"}[1m]) == 1)\nfor: 5m\nlabels:\n  severity: critical\nannotations:\n  message: |\n    Loki is reporting no logs received from any cluster in the last 5 minutes\n" err="too many unhealthy instances in the ring"
level=info ts=2022-06-27T14:41:21.7326536Z caller=metrics.go:122 component=ruler org_id=fake latency=fast query="(rate({cluster=~\".+\", job=\"kube-prometheus-stack/thanos\"} |= \"update of node failed\"[1m]) > 0.01)" query_type=metric range_type=instant length=0s step=0s duration=84.6µs status=500 limit=0 returned_lines=0 throughput=0B total_bytes=0B queue_time=0s subqueries=1
level=warn ts=2022-06-27T14:41:21.732693Z caller=manager.go:610 user=fake group=Loki msg="Evaluating rule failed" rule="alert: ThanosQueryNoData\nexpr: (rate({cluster=~\".+\", job=\"kube-prometheus-stack/thanos\"} |= \"update of node\n  failed\"[1m]) > 0.01)\nlabels:\n  severity: critical\nannotations:\n  message: |\n    Thanos has failed to update nodes/collect logs within the last minute\n" err="too many unhealthy instances in the ring"
level=info ts=2022-06-27T14:41:49.3056573Z caller=metrics.go:122 component=ruler org_id=fake latency=fast query="avg by(cluster)(rate({app=\"acp-service-srs\"} |= \"mos\" | json | State_mos<4[10s]))" query_type=metric range_type=instant length=0s step=0s duration=182.8µs status=500 limit=0 returned_lines=0 throughput=0B total_bytes=0B queue_time=0s subqueries=1
level=warn ts=2022-06-27T14:41:49.3057163Z caller=manager.go:610 user=fake group=Loki msg="Evaluating rule failed" rule="alert: SimpleRecordingServiceMosScoreLessThan4\nexpr: avg by(cluster)(rate({app=\"acp-service-srs\"} |= \"mos\" | json | State_mos<4[10s]))\n" err="too many unhealthy instances in the ring"

Here are the list of services Loki has:

kubectl get services -n loki
NAME                  TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)             AGE
loki-gateway          ClusterIP   10.96.78.215   <none>        3100/TCP            35m
loki-memberlist       ClusterIP   None           <none>        7946/TCP            35m
loki-read             ClusterIP   10.96.20.165   <none>        3100/TCP,9095/TCP   35m
loki-read-headless    ClusterIP   None           <none>        3100/TCP,9095/TCP   35m
loki-write            ClusterIP   10.96.52.84    <none>        3100/TCP,9095/TCP   35m
loki-write-headless   ClusterIP   None           <none>        3100/TCP,9095/TCP   35m

Jun 27 '22 15:06 joe-alford

it doesn't look like the ruler ring is configured correctly. could you include the actual rendered config in the configmap deployed by the helm chart?

Jun 29 '22 18:06 trevorwhitney

Hi. any updates ?

Aug 03 '22 07:08 LinTechSo

I am getting the same error with loki-simple-scalable-1.4.1 with loki-simple-scalable and object storage (not aws but similar) as backend.

I am not using the loki-gateway, but I am connecting directly the service loki-read from grafana. What is strange in my case: Grafana first is able to connect loki, but after some time it is not able anymore. So the error occurs not immediately.

As I have a small environment only, I am running 3 replicas of loki-write and 1 replica of loki-read.

Grafana is logging (when testing the datasource):

logger=context traceID=00000000000000000000000000000000 userId=1 orgId=1 uname=admin t=2022-09-29T13:28:21.723434589Z level=error msg="Failed to call resource" error="too many unhealthy instances in the ring\n" traceID=00000000000000000000000000000000
logger=context traceID=00000000000000000000000000000000 userId=1 orgId=1 uname=admin t=2022-09-29T13:28:21.723527353Z level=error msg="Request Completed" method=GET path=/api/datasources/6/resources/labels status=500 remote_addr=83.135.39.98 time_ms=22 duration=22.424318ms size=83 referer=https://grafana.xxxxxx/datasources/edit/loki-k09 traceID=00000000000000000000000000000000 handler=/api/datasources/:id/resources/*

loki-read is logging:

level=warn ts=2022-09-28T20:43:09.587511414Z caller=pool.go:184 msg="removing ingester failing healthcheck" addr=172.25.1.69:9095 reason="rpc error: code = DeadlineExceeded desc = context deadline exceeded"

loki-write is logging:

level=warn ts=2022-09-29T13:41:22.082927033Z caller=logging.go:72 traceID=7b27e088009f9574 orgID=fake msg="POST /loki/api/v1/push (500) 6.437119ms Response: \"at least 2 live replicas required, could only find 1 - unhealthy instances: 172.25.1.69:9095,172.25.2.146:9095\\n\" ws: false; Content-Length: 145358; Content-Type: application/x-protobuf; User-Agent: promtail/2.5.0;"

I am going to delete the loki-instance and deploy a new one based on Chart version 1.8.11 - to see if it's better then.

EDIT: actually it is working, I will watch this and give another feedback then

Sep 30 '22 09:09 rdxmb

I tried to run one replica each for read/write and I was getting the same error - latest chart version. I'm new to Loki so don't understand it very well.

Oct 25 '22 21:10 mateuszdrab

Got similiar error messages and had to adjust the

commonConfig: replication_factor: 1

to match my number of instances.

The number changed from original 2 (because i only have two nodes and the default replication of 3 was too much) to than 1 because after changing the storage type from s3 (default) to filesystem, everything changed (read,write and gateway were gone.

Jan 29 '23 16:01 sigi-tw

Similar problem, this is what worked for me, copy from https://github.com/grafana/loki/issues/10537#issuecomment-1759899640:

On values.yaml of the Helm chart added:

  loki:
    commonConfig:
      # set to 1, otherwise more replicas are needed to connect to grafana
      replication_factor: 1

And was able to set the rest to 1:

  write:
    replicas: 1
    persistence:
      storageClass: gp2
  read:
    replicas: 1
    persistence:
      storageClass: gp2
  backend:
    replicas: 1
    persistence:
      storageClass: gp2

Oct 12 '23 15:10 Ca-moes

@Ca-moes From the documentation the replication_factor set to 1 is for monothlic mode, I'm currently using the simple scalable mode so replication factor must be greater than 1

Jan 30 '24 14:01 ngochieu642