helm-charts icon indicating copy to clipboard operation
helm-charts copied to clipboard

[loki-simple-scalable] Helm Chart and Grafana break after updating to 1.4.3

Open joe-alford opened this issue 3 years ago • 7 comments

Hi all,

I had Loki deployed using the loki-simple-scalable 0.4.0 Helm Chart, and it was working with Grafana just fine. I've recently tried to update my dev environment to 1.4.3 from 0.4.0, and am running into issues. I assume there is a breaking change somewhere in such a big version jump, but I can't see any change logs or release notes (am I being blind?) suggesting what this might be.

Since going to 1.4.3 I get either: 502: bad gateway or Loki: Internal Server Error. 500. too many unhealthy instances in the ring from Grafana.

I fixed the 502 by changing from loki-loki-simple-scalable-gateway.loki.svc.cluster.local for Grafana and loki-loki-simple-scalable-memberlist.loki.svc.cluster.local to loki-gateway.loki.svc.cluster.local and loki-memberlist.loki.svc.cluster.local respectively. (As well as updating anything using loki-loki-simple-scalable-* to just be loki-*.

Since doing that, I am now getting the 500 - too many unhealthy instances message. Is there something obvious I should be changing? I'm guessing something changed between 0.4.0 and 1.0.0, but I don't see anything meaningful update in the readme.

Below is our relevant Helm Release config and log output:

Grafana (included for reference):

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: grafana
  namespace: grafana
  labels: 
    kustomize.toolkit.fluxcd.io/substitute: disabled #to stop is expanding out "$${__value.raw}"" in the loki config  
spec:
  values:
    datasources:
      datasources.yaml:
          - name: Loki
            type: loki
            uid: Loki
            url: http://loki-gateway.loki.svc.cluster.local:3100 
            isDefault: false
            jsonData:
              maxLines: 1000
              derivedFields:
                - datasourceUid: Tempo
                  matcherRegex: "TraceId:(.+?),"
                  name: TraceID
                  url: "$${__value.raw}"

And here is Loki:

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: loki
  namespace: loki
spec:
  chart:
    spec:
      chart: loki-simple-scalable
      version: 1.4.3
      sourceRef:
        kind: HelmRepository
        name: <redacted>-helm-repo
        namespace: flux-system
  interval: 1m
  values:
    serviceMonitor:
      enabled: true
    gateway:
      image:
        registry: docker.internal.<redacted>.net
        repository: nginxinc/nginx-unprivileged
      service:
        port: 3100
      nginxConfig:
        serverSnippet: |
           location ~ /loki/api/v1/alerts.* {
            proxy_pass       http://loki-read.loki.svc.cluster.local:3100$request_uri;
          }

          location ~ /prometheus/api/v1/rules.* {
            proxy_pass       http://loki-read.loki.svc.cluster.local:3100$request_uri;
          }
        httpSnippet: |
          client_max_body_size 0;
    write:
      repository: docker.internal.<redacted>.net/grafana/loki
      replicas: 1
      resources:
        limits:
          memory: "4Gi"
      persistence:
        size: 10Gi
        storageClass: gp3 #this is the default, but calling it out explicity so it can be overriden for dev
    read:
      replicas: 3
      persistence:
        size: 10Gi
        storageClass: gp3
      extraVolumeMounts:
        - name: loki-rules
          mountPath: /rules/fake
        - name: loki-rules-tmp
          mountPath: /tmp/scratch
        - name: loki-tmp
          mountPath: /tmp/loki-tmp
      extraVolumes:
        - name: loki-rules
          configMap:
            name: loki-alerting-rules
        - name: loki-rules-tmp
          emptyDir: {}
        - name: loki-tmp
          emptyDir: {}    

    loki: 
      image:
        registry: docker.internal.<redacted>.net
        repository: grafana/loki
        tag: 2.5.0
      structuredConfig:
        memberlist:
         join_members:
            - loki-memberlist.loki.svc.cluster.local
        auth_enabled: false
        server:
          http_listen_port: 3100
          log_level: info
          grpc_server_max_recv_msg_size: 104857600
          grpc_server_max_send_msg_size: 104857600
        schema_config: 
          configs:
          - from: "2020-11-04"
            store: boltdb-shipper
            object_store: aws
            schema: v11
            index:
              prefix: index_
              period: 24h
        storage_config: 
          boltdb_shipper:
            active_index_directory: /var/loki/index
            cache_location: /var/loki/boltdb-cache
            shared_store: s3
        ruler:
          storage:
            type: local
            local:
              directory: /rules
          rule_path: /tmp/scratch
          enable_api: true
          alertmanager_url: kube-prometheues-stack-kub-alertmanager-0.kube-prometheus-stack.svc.cluster.local, kube-prometheues-stack-kub-alertmanager-1.kube-prometheus-stack.svc.cluster.local, kube-prometheues-stack-kub-alertmanager-2.kube-prometheus-stack.svc.cluster.local
        limits_config:
          enforce_metric_name: false
          reject_old_samples: true
          reject_old_samples_max_age: 168h
          ingestion_rate_mb: 30
          ingestion_burst_size_mb: 16
          retention_period: 336h
          max_query_lookback: 336h
          max_streams_per_user: 0
          max_global_streams_per_user: 0
        compactor:
          working_directory: /var/loki/boltdb-shipper-compactor
          shared_store: filesystem
          retention_enabled: true
        chunk_store_config:
          chunk_cache_config:
            enable_fifocache: true
            fifocache:
              max_size_bytes: 500MB
        query_range:
          results_cache:
            cache:
              enable_fifocache: true
              fifocache:
                max_size_bytes: 500MB
        analytics:
          reporting_enabled: false
        ingester:
          max_chunk_age: 1h
          chunk_encoding: snappy

When I spin these up together, I get: Loki: Internal Server Error. 500. too many unhealthy instances in the ring if I Test the datasource in Grafana.

Here are the Loki pod logs:

kubectl get pods -n loki
NAME                            READY   STATUS    RESTARTS   AGE
loki-gateway-5d585556bc-tl2wb   1/1     Running   0          18m
loki-read-0                     1/1     Running   0          22m
loki-write-0                    1/1     Running   0          22

Gateway:

kubectl logs -n loki loki-gateway-5d585556bc-tl2wb
/docker-entrypoint.sh: No files found in /docker-entrypoint.d/, skipping configuration
10.244.0.1 - - [27/Jun/2022:14:45:14 +0000]  200 "GET / HTTP/1.1" 2 "-" "kube-probe/1.22" "-"

10.244.0.1 - - [27/Jun/2022:14:49:44 +0000]  200 "GET / HTTP/1.1" 2 "-" "kube-probe/1.22" "-"
2022/06/27 14:49:52 [error] 13#13: *29 open() "/etc/nginx/html/api/v1/status/buildinfo" failed (2: No such file or directory), client: 10.244.0.47, server: , request: "GET /api/v1/status/buildinfo HTTP/1.1", host: "loki-gateway.loki.svc.cluster.local:3100"
10.244.0.47 - - [27/Jun/2022:14:49:52 +0000]  404 "GET /api/v1/status/buildinfo HTTP/1.1" 154 "-" "Grafana/8.5.0" "10.244.0.14, 10.244.0.14"
10.244.0.47 - - [27/Jun/2022:14:49:52 +0000]  200 "GET /prometheus/api/v1/rules HTTP/1.1" 2882 "-" "Grafana/8.5.0" "-"
10.244.0.47 - - [27/Jun/2022:14:49:52 +0000]  400 "GET /api/prom/rules/test/test HTTP/1.1" 45 "-" "Grafana/8.5.0" "-"
10.244.0.47 - - [27/Jun/2022:14:49:52 +0000]  200 "GET /prometheus/api/v1/rules HTTP/1.1" 2882 "-" "Grafana/8.5.0" "-"
10.244.0.1 - - [27/Jun/2022:14:49:54 +0000]  200 "GET / HTTP/1.1" 2 "-" "kube-probe/1.22" "-"
10.244.0.47 - - [27/Jun/2022:14:49:58 +0000]  500 "GET /loki/api/v1/label?start=1656340798905000000 HTTP/1.1" 41 "-" "Grafana/8.5.0" "10.244.0.14, 10.244.0.14"

All of these lines are repeated several times.

Write:

 kubectl logs -n loki loki-write-0
level=info ts=2022-06-27T14:39:29.420766Z caller=main.go:106 msg="Starting Loki" version="(version=2.5.0, branch=HEAD, revision=2d9d0ee23)"
level=info ts=2022-06-27T14:39:29.4210416Z caller=server.go:260 http=[::]:3100 grpc=[::]:9095 msg="server listening on addresses"
level=warn ts=2022-06-27T14:39:29.4212857Z caller=experimental.go:20 msg="experimental feature in use" feature="In-memory (FIFO) cache"
level=warn ts=2022-06-27T14:39:29.421421Z caller=experimental.go:20 msg="experimental feature in use" feature="In-memory (FIFO) cache"
level=info ts=2022-06-27T14:39:29.421723Z caller=shipper_index_client.go:111 msg="starting boltdb shipper in 2 mode"
level=info ts=2022-06-27T14:39:29.4217657Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:39:29.4237756Z caller=memberlist_client.go:394 msg="Using memberlist cluster node name" name=loki-write-0-fe8583fa
level=info ts=2022-06-27T14:39:29.4272028Z caller=module_service.go:64 msg=initialising module=store
level=info ts=2022-06-27T14:39:29.4273979Z caller=module_service.go:64 msg=initialising module=server
level=info ts=2022-06-27T14:39:29.4274158Z caller=module_service.go:64 msg=initialising module=memberlist-kv
level=info ts=2022-06-27T14:39:29.427578Z caller=module_service.go:64 msg=initialising module=ring
level=info ts=2022-06-27T14:39:29.4276925Z caller=ring.go:272 msg="ring doesn't exist in KV store yet"
level=info ts=2022-06-27T14:39:29.4277553Z caller=module_service.go:64 msg=initialising module=ingester
level=info ts=2022-06-27T14:39:29.4277866Z caller=ingester.go:398 msg="recovering from checkpoint"
level=info ts=2022-06-27T14:39:29.4278962Z caller=module_service.go:64 msg=initialising module=distributor
level=info ts=2022-06-27T14:39:29.4279556Z caller=ingester.go:414 msg="recovered WAL checkpoint recovery finished" elapsed=174.5µs errors=false
level=info ts=2022-06-27T14:39:29.4279841Z caller=ingester.go:420 msg="recovering from WAL"
level=info ts=2022-06-27T14:39:29.4280047Z caller=lifecycler.go:546 msg="not loading tokens from file, tokens file path is empty"
level=info ts=2022-06-27T14:39:29.4281001Z caller=lifecycler.go:575 msg="instance not found in ring, adding with no tokens" ring=distributor
level=info ts=2022-06-27T14:39:29.4281252Z caller=ring.go:272 msg="ring doesn't exist in KV store yet"
level=info ts=2022-06-27T14:39:29.4281094Z caller=ingester.go:436 msg="WAL segment recovery finished" elapsed=328.3µs errors=false
level=info ts=2022-06-27T14:39:29.4282506Z caller=ingester.go:384 msg="closing recoverer"
ts=2022-06-27T14:39:29.4282719Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve loki-memberlist.loki.svc.cluster.local: lookup loki-memberlist.loki.svc.cluster.local on 10.96.0.10:53: no such host"
level=info ts=2022-06-27T14:39:29.4283205Z caller=ingester.go:392 msg="WAL recovery finished" time=538.8µs
level=info ts=2022-06-27T14:39:29.428355Z caller=lifecycler.go:422 msg="auto-joining cluster after timeout" ring=distributor
level=info ts=2022-06-27T14:39:29.4283883Z caller=lifecycler.go:546 msg="not loading tokens from file, tokens file path is empty"
level=info ts=2022-06-27T14:39:29.428426Z caller=loki.go:372 msg="Loki started"
level=info ts=2022-06-27T14:39:29.4284394Z caller=wal.go:156 msg=started component=wal
level=info ts=2022-06-27T14:39:29.4284446Z caller=lifecycler.go:575 msg="instance not found in ring, adding with no tokens" ring=ingester
level=info ts=2022-06-27T14:39:29.4285516Z caller=lifecycler.go:422 msg="auto-joining cluster after timeout" ring=ingester
ts=2022-06-27T14:39:30.548897Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve loki-memberlist.loki.svc.cluster.local: lookup loki-memberlist.loki.svc.cluster.local on 10.96.0.10:53: no such host"
ts=2022-06-27T14:39:33.4948198Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve loki-memberlist.loki.svc.cluster.local: lookup loki-memberlist.loki.svc.cluster.local on 10.96.0.10:53: no such host"
ts=2022-06-27T14:39:40.1071028Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve loki-memberlist.loki.svc.cluster.local: lookup loki-memberlist.loki.svc.cluster.local on 10.96.0.10:53: no such host"
ts=2022-06-27T14:39:50.9613016Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve loki-memberlist.loki.svc.cluster.local: lookup loki-memberlist.loki.svc.cluster.local on 10.96.0.10:53: no such host"
ts=2022-06-27T14:40:12.0220944Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve loki-memberlist.loki.svc.cluster.local: lookup loki-memberlist.loki.svc.cluster.local on 10.96.0.10:53: no such host"
level=info ts=2022-06-27T14:40:29.4218763Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:41:08.5736488Z caller=memberlist_client.go:542 msg="joined memberlist cluster" reached_nodes=2
level=info ts=2022-06-27T14:41:29.4220935Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:42:29.4228876Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:43:29.4219004Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:44:29.422408Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:44:29.428556Z caller=checkpoint.go:615 msg="starting checkpoint"
level=info ts=2022-06-27T14:44:29.4297238Z caller=checkpoint.go:340 msg="attempting checkpoint for" dir=/var/loki/wal/checkpoint.000004
level=info ts=2022-06-27T14:44:29.4342957Z caller=checkpoint.go:502 msg="atomic checkpoint finished" old=/var/loki/wal/checkpoint.000004.tmp new=/var/loki/wal/checkpoint.000004
level=info ts=2022-06-27T14:45:29.4218623Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:46:29.4225586Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:47:29.4227801Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:48:29.4219661Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:49:29.4224932Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:49:29.4291595Z caller=checkpoint.go:615 msg="starting checkpoint"
level=info ts=2022-06-27T14:49:29.4292888Z caller=checkpoint.go:340 msg="attempting checkpoint for" dir=/var/loki/wal/checkpoint.000005
level=info ts=2022-06-27T14:49:29.4346005Z caller=checkpoint.go:502 msg="atomic checkpoint finished" old=/var/loki/wal/checkpoint.000005.tmp new=/var/loki/wal/checkpoint.000005
level=info ts=2022-06-27T14:50:29.4218727Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:51:29.4219922Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:52:29.4219176Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:53:29.4221026Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:54:29.4223405Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:54:29.4284884Z caller=checkpoint.go:615 msg="starting checkpoint"
level=info ts=2022-06-27T14:54:29.428659Z caller=checkpoint.go:340 msg="attempting checkpoint for" dir=/var/loki/wal/checkpoint.000006
level=info ts=2022-06-27T14:54:29.4328397Z caller=checkpoint.go:502 msg="atomic checkpoint finished" old=/var/loki/wal/checkpoint.000006.tmp new=/var/loki/wal/checkpoint.000006
level=info ts=2022-06-27T14:55:29.4219931Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:56:29.422355Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:57:29.4219955Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:58:29.4226244Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:59:29.4218699Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:59:29.428517Z caller=checkpoint.go:615 msg="starting checkpoint"
level=info ts=2022-06-27T14:59:29.4286635Z caller=checkpoint.go:340 msg="attempting checkpoint for" dir=/var/loki/wal/checkpoint.000007
level=info ts=2022-06-27T14:59:29.4333984Z caller=checkpoint.go:502 msg="atomic checkpoint finished" old=/var/loki/wal/checkpoint.000007.tmp new=/var/loki/wal/checkpoint.000007
level=info ts=2022-06-27T15:00:29.4225507Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T15:01:29.4224497Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T15:02:29.4222936Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T15:03:29.4219203Z caller=table_manager.go:169 msg="uploading tables"
joe@/mnt/c/Users/JoeAlford/repos/kate/flux/flux-k8s-multi-cluster/kind (feature/KATE-418-loki-snappy-compression) $ kubectl exec -it -n loki loki-write-0 -- sh
/ $ nslookup loki-memberlist.loki.svc.cluster.local
Server:         10.96.0.10
Address:        10.96.0.10:53


Name:   loki-memberlist.loki.svc.cluster.local
Address: 10.244.0.44
Name:   loki-memberlist.loki.svc.cluster.local
Address: 10.244.0.45

/ $

Read:

kubectl logs -n loki loki-read-0
level=info ts=2022-06-27T14:39:28.7625301Z caller=main.go:106 msg="Starting Loki" version="(version=2.5.0, branch=HEAD, revision=2d9d0ee23)"
level=info ts=2022-06-27T14:39:28.7628752Z caller=server.go:260 http=[::]:3100 grpc=[::]:9095 msg="server listening on addresses"
level=info ts=2022-06-27T14:39:28.7639932Z caller=memberlist_client.go:394 msg="Using memberlist cluster node name" name=loki-read-0-8e2ba528
level=warn ts=2022-06-27T14:39:28.7649674Z caller=experimental.go:20 msg="experimental feature in use" feature="In-memory (FIFO) cache"
level=info ts=2022-06-27T14:39:28.7652472Z caller=shipper_index_client.go:111 msg="starting boltdb shipper in 1 mode"
level=info ts=2022-06-27T14:39:28.7662616Z caller=worker.go:112 msg="Starting querier worker using query-scheduler and scheduler ring for addresses"
level=info ts=2022-06-27T14:39:28.7674493Z caller=mapper.go:46 msg="cleaning up mapped rules directory" path=/tmp/scratch
ts=2022-06-27T14:39:28.7689852Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve loki-memberlist.loki.svc.cluster.local: lookup loki-memberlist.loki.svc.cluster.local on 10.96.0.10:53: no such host"
level=info ts=2022-06-27T14:39:28.7699166Z caller=module_service.go:64 msg=initialising module=memberlist-kv
level=info ts=2022-06-27T14:39:28.769929Z caller=module_service.go:64 msg=initialising module=server
level=info ts=2022-06-27T14:39:28.7701931Z caller=module_service.go:64 msg=initialising module=compactor
level=info ts=2022-06-27T14:39:28.7702131Z caller=module_service.go:64 msg=initialising module=query-frontend-tripperware
level=info ts=2022-06-27T14:39:28.770214Z caller=module_service.go:64 msg=initialising module=ring
level=info ts=2022-06-27T14:39:28.7702001Z caller=module_service.go:64 msg=initialising module=query-scheduler
level=info ts=2022-06-27T14:39:28.7702876Z caller=ring.go:272 msg="ring doesn't exist in KV store yet"
level=info ts=2022-06-27T14:39:28.7703295Z caller=ring.go:272 msg="ring doesn't exist in KV store yet"
level=info ts=2022-06-27T14:39:28.7703453Z caller=basic_lifecycler.go:260 msg="instance not found in the ring" instance=loki-read-0 ring=compactor
level=info ts=2022-06-27T14:39:28.7703546Z caller=basic_lifecycler.go:260 msg="instance not found in the ring" instance=loki-read-0 ring=scheduler
level=info ts=2022-06-27T14:39:28.7703634Z caller=basic_lifecycler_delegates.go:63 msg="not loading tokens from file, tokens file path is empty"
level=info ts=2022-06-27T14:39:28.7703672Z caller=basic_lifecycler_delegates.go:63 msg="not loading tokens from file, tokens file path is empty"
level=info ts=2022-06-27T14:39:28.7706609Z caller=compactor.go:264 msg="waiting until compactor is JOINING in the ring"
level=info ts=2022-06-27T14:39:28.7706828Z caller=compactor.go:268 msg="compactor is JOINING in the ring"
level=info ts=2022-06-27T14:39:28.7706944Z caller=ring.go:272 msg="ring doesn't exist in KV store yet"
level=info ts=2022-06-27T14:39:28.7707242Z caller=scheduler.go:610 msg="waiting until scheduler is JOINING in the ring"
level=info ts=2022-06-27T14:39:28.7707408Z caller=module_service.go:64 msg=initialising module=ingester-querier
level=info ts=2022-06-27T14:39:28.7707431Z caller=scheduler.go:614 msg="scheduler is JOINING in the ring"
level=info ts=2022-06-27T14:39:28.770811Z caller=module_service.go:64 msg=initialising module=store
level=info ts=2022-06-27T14:39:28.7708208Z caller=module_service.go:64 msg=initialising module=ruler
level=info ts=2022-06-27T14:39:28.7708341Z caller=ruler.go:450 msg="ruler up and running"
level=info ts=2022-06-27T14:39:28.7720462Z caller=mapper.go:154 msg="updating rule file" file=/tmp/scratch/fake/loki.yaml
level=info ts=2022-06-27T14:39:28.772156Z caller=mapper.go:154 msg="updating rule file" file=/tmp/scratch/fake/srs.yaml
level=info ts=2022-06-27T14:39:29.7714738Z caller=scheduler.go:624 msg="waiting until scheduler is ACTIVE in the ring"
level=info ts=2022-06-27T14:39:29.7715699Z caller=compactor.go:278 msg="waiting until compactor is ACTIVE in the ring"
level=info ts=2022-06-27T14:39:29.7716054Z caller=compactor.go:282 msg="compactor is ACTIVE in the ring"
level=info ts=2022-06-27T14:39:29.8847963Z caller=scheduler.go:628 msg="scheduler is ACTIVE in the ring"
level=info ts=2022-06-27T14:39:29.8848902Z caller=module_service.go:64 msg=initialising module=querier
level=info ts=2022-06-27T14:39:29.8849187Z caller=module_service.go:64 msg=initialising module=query-frontend
level=info ts=2022-06-27T14:39:29.8850374Z caller=loki.go:372 msg="Loki started"
ts=2022-06-27T14:39:30.1704537Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve loki-memberlist.loki.svc.cluster.local: lookup loki-memberlist.loki.svc.cluster.local on 10.96.0.10:53: no such host"
ts=2022-06-27T14:39:32.1798807Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve loki-memberlist.loki.svc.cluster.local: lookup loki-memberlist.loki.svc.cluster.local on 10.96.0.10:53: no such host"
level=info ts=2022-06-27T14:39:32.8853412Z caller=scheduler.go:661 msg="this scheduler is in the ReplicationSet, will now accept requests."
level=info ts=2022-06-27T14:39:32.8853813Z caller=worker.go:209 msg="adding connection" addr=10.244.0.44:9095
level=info ts=2022-06-27T14:39:34.7725692Z caller=compactor.go:324 msg="this instance has been chosen to run the compactor, starting compactor"
level=info ts=2022-06-27T14:39:34.7726541Z caller=compactor.go:351 msg="waiting 10m0s for ring to stay stable and previous compactions to finish before starting compactor"
ts=2022-06-27T14:39:38.8869391Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve loki-memberlist.loki.svc.cluster.local: lookup loki-memberlist.loki.svc.cluster.local on 10.96.0.10:53: no such host"
level=info ts=2022-06-27T14:39:39.8852869Z caller=frontend_scheduler_worker.go:101 msg="adding connection to scheduler" addr=10.244.0.44:9095
ts=2022-06-27T14:39:47.2264647Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve loki-memberlist.loki.svc.cluster.local: lookup loki-memberlist.loki.svc.cluster.local on 10.96.0.10:53: no such host"
level=info ts=2022-06-27T14:39:49.3050763Z caller=metrics.go:122 component=ruler org_id=fake latency=fast query="avg by(cluster)(rate({app=\"acp-service-srs\"} |= \"mos\" | json | State_mos<4[10s]))" query_type=metric range_type=instant length=0s step=0s duration=152.4µs status=500 limit=0 returned_lines=0 throughput=0B total_bytes=0B queue_time=0s subqueries=1
level=warn ts=2022-06-27T14:39:49.3051379Z caller=manager.go:610 user=fake group=Loki msg="Evaluating rule failed" rule="alert: SimpleRecordingServiceMosScoreLessThan4\nexpr: avg by(cluster)(rate({app=\"acp-service-srs\"} |= \"mos\" | json | State_mos<4[10s]))\n" err="empty ring"
level=info ts=2022-06-27T14:39:49.3052792Z caller=metrics.go:122 component=ruler org_id=fake latency=fast query="avg by(cluster)(rate({app=\"acp-service-srs\"} |= \"mos\" | json | State_PacketLossPercentage>0[10s]))" query_type=metric range_type=instant length=0s step=0s duration=49.5µs status=500 limit=0 returned_lines=0 throughput=0B total_bytes=0B queue_time=0s subqueries=1
level=warn ts=2022-06-27T14:39:49.3053241Z caller=manager.go:610 user=fake group=Loki msg="Evaluating rule failed" rule="alert: SimpleRecordingServicePacketLossPercentageGreaterThan0\nexpr: avg by(cluster)(rate({app=\"acp-service-srs\"} |= \"mos\" | json | State_PacketLossPercentage>0[10s]))\n" err="empty ring"
level=info ts=2022-06-27T14:39:49.3055688Z caller=metrics.go:122 component=ruler org_id=fake latency=fast query="(count_over_time({app=\"acp-service-srs\"} |= \"Error\" != \"Warning\"[1m]) > 1)" query_type=metric range_type=instant length=0s step=0s duration=133.1µs status=500 limit=0 returned_lines=0 throughput=0B total_bytes=0B queue_time=0s subqueries=1
level=warn ts=2022-06-27T14:39:49.3056162Z caller=manager.go:610 user=fake group=Loki msg="Evaluating rule failed" rule="alert: SimpleRecordingServiceErrorsAndWarningsCount\nexpr: (count_over_time({app=\"acp-service-srs\"} |= \"Error\" != \"Warning\"[1m]) > 1)\nfor: 10s\n" err="empty ring"
ts=2022-06-27T14:40:03.3543264Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve loki-memberlist.loki.svc.cluster.local: lookup loki-memberlist.loki.svc.cluster.local on 10.96.0.10:53: no such host"
level=info ts=2022-06-27T14:40:21.7312888Z caller=metrics.go:122 component=ruler org_id=fake latency=fast query="(count_over_time({app=~\"prometheus\"}[1m]) > 0)" query_type=metric range_type=instant length=0s step=0s duration=125.499µs status=500 limit=0 returned_lines=0 throughput=0B total_bytes=0B queue_time=0s subqueries=1
level=warn ts=2022-06-27T14:40:21.7313421Z caller=manager.go:610 user=fake group=Loki msg="Evaluating rule failed" rule="alert: LokiAlwaysFailingTest\nexpr: (count_over_time({app=~\"prometheus\"}[1m]) > 0)\nlabels:\n  severity: critical\nannotations:\n  message: |\n    This alert should always be firing in dev\n" err="empty ring"
level=info ts=2022-06-27T14:40:21.7315089Z caller=metrics.go:122 component=ruler org_id=fake latency=fast query="(absent_over_time({cluster=~\".+\"}[1m]) == 1)" query_type=metric range_type=instant length=0s step=0s duration=53.9µs status=500 limit=0 returned_lines=0 throughput=0B total_bytes=0B queue_time=0s subqueries=1
level=warn ts=2022-06-27T14:40:21.7315527Z caller=manager.go:610 user=fake group=Loki msg="Evaluating rule failed" rule="alert: LokiNoLogsFoundForAnyCluster\nexpr: (absent_over_time({cluster=~\".+\"}[1m]) == 1)\nfor: 5m\nlabels:\n  severity: critical\nannotations:\n  message: |\n    Loki is reporting no logs received from any cluster in the last 5 minutes\n" err="empty ring"
level=info ts=2022-06-27T14:40:21.7317404Z caller=metrics.go:122 component=ruler org_id=fake latency=fast query="(rate({cluster=~\".+\", job=\"kube-prometheus-stack/thanos\"} |= \"update of node failed\"[1m]) > 0.01)" query_type=metric range_type=instant length=0s step=0s duration=91.2µs status=500 limit=0 returned_lines=0 throughput=0B total_bytes=0B queue_time=0s subqueries=1
level=warn ts=2022-06-27T14:40:21.7317749Z caller=manager.go:610 user=fake group=Loki msg="Evaluating rule failed" rule="alert: ThanosQueryNoData\nexpr: (rate({cluster=~\".+\", job=\"kube-prometheus-stack/thanos\"} |= \"update of node\n  failed\"[1m]) > 0.01)\nlabels:\n  severity: critical\nannotations:\n  message: |\n    Thanos has failed to update nodes/collect logs within the last minute\n" err="empty ring"
level=info ts=2022-06-27T14:40:49.3050531Z caller=metrics.go:122 component=ruler org_id=fake latency=fast query="avg by(cluster)(rate({app=\"acp-service-srs\"} |= \"mos\" | json | State_mos<4[10s]))" query_type=metric range_type=instant length=0s step=0s duration=126.699µs status=500 limit=0 returned_lines=0 throughput=0B total_bytes=0B queue_time=0s subqueries=1
level=warn ts=2022-06-27T14:40:49.3051283Z caller=manager.go:610 user=fake group=Loki msg="Evaluating rule failed" rule="alert: SimpleRecordingServiceMosScoreLessThan4\nexpr: avg by(cluster)(rate({app=\"acp-service-srs\"} |= \"mos\" | json | State_mos<4[10s]))\n" err="empty ring"
level=info ts=2022-06-27T14:40:49.3053076Z caller=metrics.go:122 component=ruler org_id=fake latency=fast query="avg by(cluster)(rate({app=\"acp-service-srs\"} |= \"mos\" | json | State_PacketLossPercentage>0[10s]))" query_type=metric range_type=instant length=0s step=0s duration=64.6µs status=500 limit=0 returned_lines=0 throughput=0B total_bytes=0B queue_time=0s subqueries=1
level=warn ts=2022-06-27T14:40:49.3053434Z caller=manager.go:610 user=fake group=Loki msg="Evaluating rule failed" rule="alert: SimpleRecordingServicePacketLossPercentageGreaterThan0\nexpr: avg by(cluster)(rate({app=\"acp-service-srs\"} |= \"mos\" | json | State_PacketLossPercentage>0[10s]))\n" err="empty ring"
level=info ts=2022-06-27T14:40:49.3054569Z caller=metrics.go:122 component=ruler org_id=fake latency=fast query="(count_over_time({app=\"acp-service-srs\"} |= \"Error\" != \"Warning\"[1m]) > 1)" query_type=metric range_type=instant length=0s step=0s duration=44.9µs status=500 limit=0 returned_lines=0 throughput=0B total_bytes=0B queue_time=0s subqueries=1
level=warn ts=2022-06-27T14:40:49.3054907Z caller=manager.go:610 user=fake group=Loki msg="Evaluating rule failed" rule="alert: SimpleRecordingServiceErrorsAndWarningsCount\nexpr: (count_over_time({app=\"acp-service-srs\"} |= \"Error\" != \"Warning\"[1m]) > 1)\nfor: 10s\n" err="empty ring"
level=info ts=2022-06-27T14:40:56.6814446Z caller=memberlist_client.go:542 msg="joined memberlist cluster" reached_nodes=2
level=info ts=2022-06-27T14:41:21.7320922Z caller=metrics.go:122 component=ruler org_id=fake latency=fast query="(count_over_time({app=~\"prometheus\"}[1m]) > 0)" query_type=metric range_type=instant length=0s step=0s duration=250.999µs status=500 limit=0 returned_lines=0 throughput=0B total_bytes=0B queue_time=0s subqueries=1
level=warn ts=2022-06-27T14:41:21.732152Z caller=manager.go:610 user=fake group=Loki msg="Evaluating rule failed" rule="alert: LokiAlwaysFailingTest\nexpr: (count_over_time({app=~\"prometheus\"}[1m]) > 0)\nlabels:\n  severity: critical\nannotations:\n  message: |\n    This alert should always be firing in dev\n" err="too many unhealthy instances in the ring"
level=info ts=2022-06-27T14:41:21.7324097Z caller=metrics.go:122 component=ruler org_id=fake latency=fast query="(absent_over_time({cluster=~\".+\"}[1m]) == 1)" query_type=metric range_type=instant length=0s step=0s duration=72.5µs status=500 limit=0 returned_lines=0 throughput=0B total_bytes=0B queue_time=0s subqueries=1
level=warn ts=2022-06-27T14:41:21.7324569Z caller=manager.go:610 user=fake group=Loki msg="Evaluating rule failed" rule="alert: LokiNoLogsFoundForAnyCluster\nexpr: (absent_over_time({cluster=~\".+\"}[1m]) == 1)\nfor: 5m\nlabels:\n  severity: critical\nannotations:\n  message: |\n    Loki is reporting no logs received from any cluster in the last 5 minutes\n" err="too many unhealthy instances in the ring"
level=info ts=2022-06-27T14:41:21.7326536Z caller=metrics.go:122 component=ruler org_id=fake latency=fast query="(rate({cluster=~\".+\", job=\"kube-prometheus-stack/thanos\"} |= \"update of node failed\"[1m]) > 0.01)" query_type=metric range_type=instant length=0s step=0s duration=84.6µs status=500 limit=0 returned_lines=0 throughput=0B total_bytes=0B queue_time=0s subqueries=1
level=warn ts=2022-06-27T14:41:21.732693Z caller=manager.go:610 user=fake group=Loki msg="Evaluating rule failed" rule="alert: ThanosQueryNoData\nexpr: (rate({cluster=~\".+\", job=\"kube-prometheus-stack/thanos\"} |= \"update of node\n  failed\"[1m]) > 0.01)\nlabels:\n  severity: critical\nannotations:\n  message: |\n    Thanos has failed to update nodes/collect logs within the last minute\n" err="too many unhealthy instances in the ring"
level=info ts=2022-06-27T14:41:49.3056573Z caller=metrics.go:122 component=ruler org_id=fake latency=fast query="avg by(cluster)(rate({app=\"acp-service-srs\"} |= \"mos\" | json | State_mos<4[10s]))" query_type=metric range_type=instant length=0s step=0s duration=182.8µs status=500 limit=0 returned_lines=0 throughput=0B total_bytes=0B queue_time=0s subqueries=1
level=warn ts=2022-06-27T14:41:49.3057163Z caller=manager.go:610 user=fake group=Loki msg="Evaluating rule failed" rule="alert: SimpleRecordingServiceMosScoreLessThan4\nexpr: avg by(cluster)(rate({app=\"acp-service-srs\"} |= \"mos\" | json | State_mos<4[10s]))\n" err="too many unhealthy instances in the ring"

Here are the list of services Loki has:

kubectl get services -n loki
NAME                  TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)             AGE
loki-gateway          ClusterIP   10.96.78.215   <none>        3100/TCP            35m
loki-memberlist       ClusterIP   None           <none>        7946/TCP            35m
loki-read             ClusterIP   10.96.20.165   <none>        3100/TCP,9095/TCP   35m
loki-read-headless    ClusterIP   None           <none>        3100/TCP,9095/TCP   35m
loki-write            ClusterIP   10.96.52.84    <none>        3100/TCP,9095/TCP   35m
loki-write-headless   ClusterIP   None           <none>        3100/TCP,9095/TCP   35m

joe-alford avatar Jun 27 '22 15:06 joe-alford

it doesn't look like the ruler ring is configured correctly. could you include the actual rendered config in the configmap deployed by the helm chart?

trevorwhitney avatar Jun 29 '22 18:06 trevorwhitney

Hi. any updates ?

LinTechSo avatar Aug 03 '22 07:08 LinTechSo

I am getting the same error with loki-simple-scalable-1.4.1 with loki-simple-scalable and object storage (not aws but similar) as backend.

I am not using the loki-gateway, but I am connecting directly the service loki-read from grafana. What is strange in my case: Grafana first is able to connect loki, but after some time it is not able anymore. So the error occurs not immediately.

As I have a small environment only, I am running 3 replicas of loki-write and 1 replica of loki-read.

Grafana is logging (when testing the datasource):

logger=context traceID=00000000000000000000000000000000 userId=1 orgId=1 uname=admin t=2022-09-29T13:28:21.723434589Z level=error msg="Failed to call resource" error="too many unhealthy instances in the ring\n" traceID=00000000000000000000000000000000
logger=context traceID=00000000000000000000000000000000 userId=1 orgId=1 uname=admin t=2022-09-29T13:28:21.723527353Z level=error msg="Request Completed" method=GET path=/api/datasources/6/resources/labels status=500 remote_addr=83.135.39.98 time_ms=22 duration=22.424318ms size=83 referer=https://grafana.xxxxxx/datasources/edit/loki-k09 traceID=00000000000000000000000000000000 handler=/api/datasources/:id/resources/*

loki-read is logging:

level=warn ts=2022-09-28T20:43:09.587511414Z caller=pool.go:184 msg="removing ingester failing healthcheck" addr=172.25.1.69:9095 reason="rpc error: code = DeadlineExceeded desc = context deadline exceeded"

loki-write is logging:

level=warn ts=2022-09-29T13:41:22.082927033Z caller=logging.go:72 traceID=7b27e088009f9574 orgID=fake msg="POST /loki/api/v1/push (500) 6.437119ms Response: \"at least 2 live replicas required, could only find 1 - unhealthy instances: 172.25.1.69:9095,172.25.2.146:9095\\n\" ws: false; Content-Length: 145358; Content-Type: application/x-protobuf; User-Agent: promtail/2.5.0;"

I am going to delete the loki-instance and deploy a new one based on Chart version 1.8.11 - to see if it's better then.


EDIT: actually it is working, I will watch this and give another feedback then

rdxmb avatar Sep 30 '22 09:09 rdxmb

I tried to run one replica each for read/write and I was getting the same error - latest chart version. I'm new to Loki so don't understand it very well.

mateuszdrab avatar Oct 25 '22 21:10 mateuszdrab

Got similiar error messages and had to adjust the

commonConfig: replication_factor: 1

to match my number of instances.

The number changed from original 2 (because i only have two nodes and the default replication of 3 was too much) to than 1 because after changing the storage type from s3 (default) to filesystem, everything changed (read,write and gateway were gone.

sigi-tw avatar Jan 29 '23 16:01 sigi-tw

Similar problem, this is what worked for me, copy from https://github.com/grafana/loki/issues/10537#issuecomment-1759899640:

On values.yaml of the Helm chart added:

  loki:
    commonConfig:
      # set to 1, otherwise more replicas are needed to connect to grafana
      replication_factor: 1

And was able to set the rest to 1:

  write:
    replicas: 1
    persistence:
      storageClass: gp2
  read:
    replicas: 1
    persistence:
      storageClass: gp2
  backend:
    replicas: 1
    persistence:
      storageClass: gp2

Ca-moes avatar Oct 12 '23 15:10 Ca-moes

@Ca-moes From the documentation the replication_factor set to 1 is for monothlic mode, I'm currently using the simple scalable mode so replication factor must be greater than 1

ngochieu642 avatar Jan 30 '24 14:01 ngochieu642