[loki-simple-scalable] Helm Chart and Grafana break after updating to 1.4.3
Hi all,
I had Loki deployed using the loki-simple-scalable 0.4.0 Helm Chart, and it was working with Grafana just fine. I've recently tried to update my dev environment to 1.4.3 from 0.4.0, and am running into issues. I assume there is a breaking change somewhere in such a big version jump, but I can't see any change logs or release notes (am I being blind?) suggesting what this might be.
Since going to 1.4.3 I get either: 502: bad gateway or Loki: Internal Server Error. 500. too many unhealthy instances in the ring from Grafana.
I fixed the 502 by changing from loki-loki-simple-scalable-gateway.loki.svc.cluster.local for Grafana and loki-loki-simple-scalable-memberlist.loki.svc.cluster.local to loki-gateway.loki.svc.cluster.local and loki-memberlist.loki.svc.cluster.local respectively. (As well as updating anything using loki-loki-simple-scalable-* to just be loki-*.
Since doing that, I am now getting the 500 - too many unhealthy instances message. Is there something obvious I should be changing? I'm guessing something changed between 0.4.0 and 1.0.0, but I don't see anything meaningful update in the readme.
Below is our relevant Helm Release config and log output:
Grafana (included for reference):
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: grafana
namespace: grafana
labels:
kustomize.toolkit.fluxcd.io/substitute: disabled #to stop is expanding out "$${__value.raw}"" in the loki config
spec:
values:
datasources:
datasources.yaml:
- name: Loki
type: loki
uid: Loki
url: http://loki-gateway.loki.svc.cluster.local:3100
isDefault: false
jsonData:
maxLines: 1000
derivedFields:
- datasourceUid: Tempo
matcherRegex: "TraceId:(.+?),"
name: TraceID
url: "$${__value.raw}"
And here is Loki:
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: loki
namespace: loki
spec:
chart:
spec:
chart: loki-simple-scalable
version: 1.4.3
sourceRef:
kind: HelmRepository
name: <redacted>-helm-repo
namespace: flux-system
interval: 1m
values:
serviceMonitor:
enabled: true
gateway:
image:
registry: docker.internal.<redacted>.net
repository: nginxinc/nginx-unprivileged
service:
port: 3100
nginxConfig:
serverSnippet: |
location ~ /loki/api/v1/alerts.* {
proxy_pass http://loki-read.loki.svc.cluster.local:3100$request_uri;
}
location ~ /prometheus/api/v1/rules.* {
proxy_pass http://loki-read.loki.svc.cluster.local:3100$request_uri;
}
httpSnippet: |
client_max_body_size 0;
write:
repository: docker.internal.<redacted>.net/grafana/loki
replicas: 1
resources:
limits:
memory: "4Gi"
persistence:
size: 10Gi
storageClass: gp3 #this is the default, but calling it out explicity so it can be overriden for dev
read:
replicas: 3
persistence:
size: 10Gi
storageClass: gp3
extraVolumeMounts:
- name: loki-rules
mountPath: /rules/fake
- name: loki-rules-tmp
mountPath: /tmp/scratch
- name: loki-tmp
mountPath: /tmp/loki-tmp
extraVolumes:
- name: loki-rules
configMap:
name: loki-alerting-rules
- name: loki-rules-tmp
emptyDir: {}
- name: loki-tmp
emptyDir: {}
loki:
image:
registry: docker.internal.<redacted>.net
repository: grafana/loki
tag: 2.5.0
structuredConfig:
memberlist:
join_members:
- loki-memberlist.loki.svc.cluster.local
auth_enabled: false
server:
http_listen_port: 3100
log_level: info
grpc_server_max_recv_msg_size: 104857600
grpc_server_max_send_msg_size: 104857600
schema_config:
configs:
- from: "2020-11-04"
store: boltdb-shipper
object_store: aws
schema: v11
index:
prefix: index_
period: 24h
storage_config:
boltdb_shipper:
active_index_directory: /var/loki/index
cache_location: /var/loki/boltdb-cache
shared_store: s3
ruler:
storage:
type: local
local:
directory: /rules
rule_path: /tmp/scratch
enable_api: true
alertmanager_url: kube-prometheues-stack-kub-alertmanager-0.kube-prometheus-stack.svc.cluster.local, kube-prometheues-stack-kub-alertmanager-1.kube-prometheus-stack.svc.cluster.local, kube-prometheues-stack-kub-alertmanager-2.kube-prometheus-stack.svc.cluster.local
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 168h
ingestion_rate_mb: 30
ingestion_burst_size_mb: 16
retention_period: 336h
max_query_lookback: 336h
max_streams_per_user: 0
max_global_streams_per_user: 0
compactor:
working_directory: /var/loki/boltdb-shipper-compactor
shared_store: filesystem
retention_enabled: true
chunk_store_config:
chunk_cache_config:
enable_fifocache: true
fifocache:
max_size_bytes: 500MB
query_range:
results_cache:
cache:
enable_fifocache: true
fifocache:
max_size_bytes: 500MB
analytics:
reporting_enabled: false
ingester:
max_chunk_age: 1h
chunk_encoding: snappy
When I spin these up together, I get: Loki: Internal Server Error. 500. too many unhealthy instances in the ring if I Test the datasource in Grafana.
Here are the Loki pod logs:
kubectl get pods -n loki
NAME READY STATUS RESTARTS AGE
loki-gateway-5d585556bc-tl2wb 1/1 Running 0 18m
loki-read-0 1/1 Running 0 22m
loki-write-0 1/1 Running 0 22
Gateway:
kubectl logs -n loki loki-gateway-5d585556bc-tl2wb
/docker-entrypoint.sh: No files found in /docker-entrypoint.d/, skipping configuration
10.244.0.1 - - [27/Jun/2022:14:45:14 +0000] 200 "GET / HTTP/1.1" 2 "-" "kube-probe/1.22" "-"
10.244.0.1 - - [27/Jun/2022:14:49:44 +0000] 200 "GET / HTTP/1.1" 2 "-" "kube-probe/1.22" "-"
2022/06/27 14:49:52 [error] 13#13: *29 open() "/etc/nginx/html/api/v1/status/buildinfo" failed (2: No such file or directory), client: 10.244.0.47, server: , request: "GET /api/v1/status/buildinfo HTTP/1.1", host: "loki-gateway.loki.svc.cluster.local:3100"
10.244.0.47 - - [27/Jun/2022:14:49:52 +0000] 404 "GET /api/v1/status/buildinfo HTTP/1.1" 154 "-" "Grafana/8.5.0" "10.244.0.14, 10.244.0.14"
10.244.0.47 - - [27/Jun/2022:14:49:52 +0000] 200 "GET /prometheus/api/v1/rules HTTP/1.1" 2882 "-" "Grafana/8.5.0" "-"
10.244.0.47 - - [27/Jun/2022:14:49:52 +0000] 400 "GET /api/prom/rules/test/test HTTP/1.1" 45 "-" "Grafana/8.5.0" "-"
10.244.0.47 - - [27/Jun/2022:14:49:52 +0000] 200 "GET /prometheus/api/v1/rules HTTP/1.1" 2882 "-" "Grafana/8.5.0" "-"
10.244.0.1 - - [27/Jun/2022:14:49:54 +0000] 200 "GET / HTTP/1.1" 2 "-" "kube-probe/1.22" "-"
10.244.0.47 - - [27/Jun/2022:14:49:58 +0000] 500 "GET /loki/api/v1/label?start=1656340798905000000 HTTP/1.1" 41 "-" "Grafana/8.5.0" "10.244.0.14, 10.244.0.14"
All of these lines are repeated several times.
Write:
kubectl logs -n loki loki-write-0
level=info ts=2022-06-27T14:39:29.420766Z caller=main.go:106 msg="Starting Loki" version="(version=2.5.0, branch=HEAD, revision=2d9d0ee23)"
level=info ts=2022-06-27T14:39:29.4210416Z caller=server.go:260 http=[::]:3100 grpc=[::]:9095 msg="server listening on addresses"
level=warn ts=2022-06-27T14:39:29.4212857Z caller=experimental.go:20 msg="experimental feature in use" feature="In-memory (FIFO) cache"
level=warn ts=2022-06-27T14:39:29.421421Z caller=experimental.go:20 msg="experimental feature in use" feature="In-memory (FIFO) cache"
level=info ts=2022-06-27T14:39:29.421723Z caller=shipper_index_client.go:111 msg="starting boltdb shipper in 2 mode"
level=info ts=2022-06-27T14:39:29.4217657Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:39:29.4237756Z caller=memberlist_client.go:394 msg="Using memberlist cluster node name" name=loki-write-0-fe8583fa
level=info ts=2022-06-27T14:39:29.4272028Z caller=module_service.go:64 msg=initialising module=store
level=info ts=2022-06-27T14:39:29.4273979Z caller=module_service.go:64 msg=initialising module=server
level=info ts=2022-06-27T14:39:29.4274158Z caller=module_service.go:64 msg=initialising module=memberlist-kv
level=info ts=2022-06-27T14:39:29.427578Z caller=module_service.go:64 msg=initialising module=ring
level=info ts=2022-06-27T14:39:29.4276925Z caller=ring.go:272 msg="ring doesn't exist in KV store yet"
level=info ts=2022-06-27T14:39:29.4277553Z caller=module_service.go:64 msg=initialising module=ingester
level=info ts=2022-06-27T14:39:29.4277866Z caller=ingester.go:398 msg="recovering from checkpoint"
level=info ts=2022-06-27T14:39:29.4278962Z caller=module_service.go:64 msg=initialising module=distributor
level=info ts=2022-06-27T14:39:29.4279556Z caller=ingester.go:414 msg="recovered WAL checkpoint recovery finished" elapsed=174.5µs errors=false
level=info ts=2022-06-27T14:39:29.4279841Z caller=ingester.go:420 msg="recovering from WAL"
level=info ts=2022-06-27T14:39:29.4280047Z caller=lifecycler.go:546 msg="not loading tokens from file, tokens file path is empty"
level=info ts=2022-06-27T14:39:29.4281001Z caller=lifecycler.go:575 msg="instance not found in ring, adding with no tokens" ring=distributor
level=info ts=2022-06-27T14:39:29.4281252Z caller=ring.go:272 msg="ring doesn't exist in KV store yet"
level=info ts=2022-06-27T14:39:29.4281094Z caller=ingester.go:436 msg="WAL segment recovery finished" elapsed=328.3µs errors=false
level=info ts=2022-06-27T14:39:29.4282506Z caller=ingester.go:384 msg="closing recoverer"
ts=2022-06-27T14:39:29.4282719Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve loki-memberlist.loki.svc.cluster.local: lookup loki-memberlist.loki.svc.cluster.local on 10.96.0.10:53: no such host"
level=info ts=2022-06-27T14:39:29.4283205Z caller=ingester.go:392 msg="WAL recovery finished" time=538.8µs
level=info ts=2022-06-27T14:39:29.428355Z caller=lifecycler.go:422 msg="auto-joining cluster after timeout" ring=distributor
level=info ts=2022-06-27T14:39:29.4283883Z caller=lifecycler.go:546 msg="not loading tokens from file, tokens file path is empty"
level=info ts=2022-06-27T14:39:29.428426Z caller=loki.go:372 msg="Loki started"
level=info ts=2022-06-27T14:39:29.4284394Z caller=wal.go:156 msg=started component=wal
level=info ts=2022-06-27T14:39:29.4284446Z caller=lifecycler.go:575 msg="instance not found in ring, adding with no tokens" ring=ingester
level=info ts=2022-06-27T14:39:29.4285516Z caller=lifecycler.go:422 msg="auto-joining cluster after timeout" ring=ingester
ts=2022-06-27T14:39:30.548897Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve loki-memberlist.loki.svc.cluster.local: lookup loki-memberlist.loki.svc.cluster.local on 10.96.0.10:53: no such host"
ts=2022-06-27T14:39:33.4948198Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve loki-memberlist.loki.svc.cluster.local: lookup loki-memberlist.loki.svc.cluster.local on 10.96.0.10:53: no such host"
ts=2022-06-27T14:39:40.1071028Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve loki-memberlist.loki.svc.cluster.local: lookup loki-memberlist.loki.svc.cluster.local on 10.96.0.10:53: no such host"
ts=2022-06-27T14:39:50.9613016Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve loki-memberlist.loki.svc.cluster.local: lookup loki-memberlist.loki.svc.cluster.local on 10.96.0.10:53: no such host"
ts=2022-06-27T14:40:12.0220944Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve loki-memberlist.loki.svc.cluster.local: lookup loki-memberlist.loki.svc.cluster.local on 10.96.0.10:53: no such host"
level=info ts=2022-06-27T14:40:29.4218763Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:41:08.5736488Z caller=memberlist_client.go:542 msg="joined memberlist cluster" reached_nodes=2
level=info ts=2022-06-27T14:41:29.4220935Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:42:29.4228876Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:43:29.4219004Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:44:29.422408Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:44:29.428556Z caller=checkpoint.go:615 msg="starting checkpoint"
level=info ts=2022-06-27T14:44:29.4297238Z caller=checkpoint.go:340 msg="attempting checkpoint for" dir=/var/loki/wal/checkpoint.000004
level=info ts=2022-06-27T14:44:29.4342957Z caller=checkpoint.go:502 msg="atomic checkpoint finished" old=/var/loki/wal/checkpoint.000004.tmp new=/var/loki/wal/checkpoint.000004
level=info ts=2022-06-27T14:45:29.4218623Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:46:29.4225586Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:47:29.4227801Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:48:29.4219661Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:49:29.4224932Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:49:29.4291595Z caller=checkpoint.go:615 msg="starting checkpoint"
level=info ts=2022-06-27T14:49:29.4292888Z caller=checkpoint.go:340 msg="attempting checkpoint for" dir=/var/loki/wal/checkpoint.000005
level=info ts=2022-06-27T14:49:29.4346005Z caller=checkpoint.go:502 msg="atomic checkpoint finished" old=/var/loki/wal/checkpoint.000005.tmp new=/var/loki/wal/checkpoint.000005
level=info ts=2022-06-27T14:50:29.4218727Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:51:29.4219922Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:52:29.4219176Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:53:29.4221026Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:54:29.4223405Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:54:29.4284884Z caller=checkpoint.go:615 msg="starting checkpoint"
level=info ts=2022-06-27T14:54:29.428659Z caller=checkpoint.go:340 msg="attempting checkpoint for" dir=/var/loki/wal/checkpoint.000006
level=info ts=2022-06-27T14:54:29.4328397Z caller=checkpoint.go:502 msg="atomic checkpoint finished" old=/var/loki/wal/checkpoint.000006.tmp new=/var/loki/wal/checkpoint.000006
level=info ts=2022-06-27T14:55:29.4219931Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:56:29.422355Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:57:29.4219955Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:58:29.4226244Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:59:29.4218699Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T14:59:29.428517Z caller=checkpoint.go:615 msg="starting checkpoint"
level=info ts=2022-06-27T14:59:29.4286635Z caller=checkpoint.go:340 msg="attempting checkpoint for" dir=/var/loki/wal/checkpoint.000007
level=info ts=2022-06-27T14:59:29.4333984Z caller=checkpoint.go:502 msg="atomic checkpoint finished" old=/var/loki/wal/checkpoint.000007.tmp new=/var/loki/wal/checkpoint.000007
level=info ts=2022-06-27T15:00:29.4225507Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T15:01:29.4224497Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T15:02:29.4222936Z caller=table_manager.go:169 msg="uploading tables"
level=info ts=2022-06-27T15:03:29.4219203Z caller=table_manager.go:169 msg="uploading tables"
joe@/mnt/c/Users/JoeAlford/repos/kate/flux/flux-k8s-multi-cluster/kind (feature/KATE-418-loki-snappy-compression) $ kubectl exec -it -n loki loki-write-0 -- sh
/ $ nslookup loki-memberlist.loki.svc.cluster.local
Server: 10.96.0.10
Address: 10.96.0.10:53
Name: loki-memberlist.loki.svc.cluster.local
Address: 10.244.0.44
Name: loki-memberlist.loki.svc.cluster.local
Address: 10.244.0.45
/ $
Read:
kubectl logs -n loki loki-read-0
level=info ts=2022-06-27T14:39:28.7625301Z caller=main.go:106 msg="Starting Loki" version="(version=2.5.0, branch=HEAD, revision=2d9d0ee23)"
level=info ts=2022-06-27T14:39:28.7628752Z caller=server.go:260 http=[::]:3100 grpc=[::]:9095 msg="server listening on addresses"
level=info ts=2022-06-27T14:39:28.7639932Z caller=memberlist_client.go:394 msg="Using memberlist cluster node name" name=loki-read-0-8e2ba528
level=warn ts=2022-06-27T14:39:28.7649674Z caller=experimental.go:20 msg="experimental feature in use" feature="In-memory (FIFO) cache"
level=info ts=2022-06-27T14:39:28.7652472Z caller=shipper_index_client.go:111 msg="starting boltdb shipper in 1 mode"
level=info ts=2022-06-27T14:39:28.7662616Z caller=worker.go:112 msg="Starting querier worker using query-scheduler and scheduler ring for addresses"
level=info ts=2022-06-27T14:39:28.7674493Z caller=mapper.go:46 msg="cleaning up mapped rules directory" path=/tmp/scratch
ts=2022-06-27T14:39:28.7689852Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve loki-memberlist.loki.svc.cluster.local: lookup loki-memberlist.loki.svc.cluster.local on 10.96.0.10:53: no such host"
level=info ts=2022-06-27T14:39:28.7699166Z caller=module_service.go:64 msg=initialising module=memberlist-kv
level=info ts=2022-06-27T14:39:28.769929Z caller=module_service.go:64 msg=initialising module=server
level=info ts=2022-06-27T14:39:28.7701931Z caller=module_service.go:64 msg=initialising module=compactor
level=info ts=2022-06-27T14:39:28.7702131Z caller=module_service.go:64 msg=initialising module=query-frontend-tripperware
level=info ts=2022-06-27T14:39:28.770214Z caller=module_service.go:64 msg=initialising module=ring
level=info ts=2022-06-27T14:39:28.7702001Z caller=module_service.go:64 msg=initialising module=query-scheduler
level=info ts=2022-06-27T14:39:28.7702876Z caller=ring.go:272 msg="ring doesn't exist in KV store yet"
level=info ts=2022-06-27T14:39:28.7703295Z caller=ring.go:272 msg="ring doesn't exist in KV store yet"
level=info ts=2022-06-27T14:39:28.7703453Z caller=basic_lifecycler.go:260 msg="instance not found in the ring" instance=loki-read-0 ring=compactor
level=info ts=2022-06-27T14:39:28.7703546Z caller=basic_lifecycler.go:260 msg="instance not found in the ring" instance=loki-read-0 ring=scheduler
level=info ts=2022-06-27T14:39:28.7703634Z caller=basic_lifecycler_delegates.go:63 msg="not loading tokens from file, tokens file path is empty"
level=info ts=2022-06-27T14:39:28.7703672Z caller=basic_lifecycler_delegates.go:63 msg="not loading tokens from file, tokens file path is empty"
level=info ts=2022-06-27T14:39:28.7706609Z caller=compactor.go:264 msg="waiting until compactor is JOINING in the ring"
level=info ts=2022-06-27T14:39:28.7706828Z caller=compactor.go:268 msg="compactor is JOINING in the ring"
level=info ts=2022-06-27T14:39:28.7706944Z caller=ring.go:272 msg="ring doesn't exist in KV store yet"
level=info ts=2022-06-27T14:39:28.7707242Z caller=scheduler.go:610 msg="waiting until scheduler is JOINING in the ring"
level=info ts=2022-06-27T14:39:28.7707408Z caller=module_service.go:64 msg=initialising module=ingester-querier
level=info ts=2022-06-27T14:39:28.7707431Z caller=scheduler.go:614 msg="scheduler is JOINING in the ring"
level=info ts=2022-06-27T14:39:28.770811Z caller=module_service.go:64 msg=initialising module=store
level=info ts=2022-06-27T14:39:28.7708208Z caller=module_service.go:64 msg=initialising module=ruler
level=info ts=2022-06-27T14:39:28.7708341Z caller=ruler.go:450 msg="ruler up and running"
level=info ts=2022-06-27T14:39:28.7720462Z caller=mapper.go:154 msg="updating rule file" file=/tmp/scratch/fake/loki.yaml
level=info ts=2022-06-27T14:39:28.772156Z caller=mapper.go:154 msg="updating rule file" file=/tmp/scratch/fake/srs.yaml
level=info ts=2022-06-27T14:39:29.7714738Z caller=scheduler.go:624 msg="waiting until scheduler is ACTIVE in the ring"
level=info ts=2022-06-27T14:39:29.7715699Z caller=compactor.go:278 msg="waiting until compactor is ACTIVE in the ring"
level=info ts=2022-06-27T14:39:29.7716054Z caller=compactor.go:282 msg="compactor is ACTIVE in the ring"
level=info ts=2022-06-27T14:39:29.8847963Z caller=scheduler.go:628 msg="scheduler is ACTIVE in the ring"
level=info ts=2022-06-27T14:39:29.8848902Z caller=module_service.go:64 msg=initialising module=querier
level=info ts=2022-06-27T14:39:29.8849187Z caller=module_service.go:64 msg=initialising module=query-frontend
level=info ts=2022-06-27T14:39:29.8850374Z caller=loki.go:372 msg="Loki started"
ts=2022-06-27T14:39:30.1704537Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve loki-memberlist.loki.svc.cluster.local: lookup loki-memberlist.loki.svc.cluster.local on 10.96.0.10:53: no such host"
ts=2022-06-27T14:39:32.1798807Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve loki-memberlist.loki.svc.cluster.local: lookup loki-memberlist.loki.svc.cluster.local on 10.96.0.10:53: no such host"
level=info ts=2022-06-27T14:39:32.8853412Z caller=scheduler.go:661 msg="this scheduler is in the ReplicationSet, will now accept requests."
level=info ts=2022-06-27T14:39:32.8853813Z caller=worker.go:209 msg="adding connection" addr=10.244.0.44:9095
level=info ts=2022-06-27T14:39:34.7725692Z caller=compactor.go:324 msg="this instance has been chosen to run the compactor, starting compactor"
level=info ts=2022-06-27T14:39:34.7726541Z caller=compactor.go:351 msg="waiting 10m0s for ring to stay stable and previous compactions to finish before starting compactor"
ts=2022-06-27T14:39:38.8869391Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve loki-memberlist.loki.svc.cluster.local: lookup loki-memberlist.loki.svc.cluster.local on 10.96.0.10:53: no such host"
level=info ts=2022-06-27T14:39:39.8852869Z caller=frontend_scheduler_worker.go:101 msg="adding connection to scheduler" addr=10.244.0.44:9095
ts=2022-06-27T14:39:47.2264647Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve loki-memberlist.loki.svc.cluster.local: lookup loki-memberlist.loki.svc.cluster.local on 10.96.0.10:53: no such host"
level=info ts=2022-06-27T14:39:49.3050763Z caller=metrics.go:122 component=ruler org_id=fake latency=fast query="avg by(cluster)(rate({app=\"acp-service-srs\"} |= \"mos\" | json | State_mos<4[10s]))" query_type=metric range_type=instant length=0s step=0s duration=152.4µs status=500 limit=0 returned_lines=0 throughput=0B total_bytes=0B queue_time=0s subqueries=1
level=warn ts=2022-06-27T14:39:49.3051379Z caller=manager.go:610 user=fake group=Loki msg="Evaluating rule failed" rule="alert: SimpleRecordingServiceMosScoreLessThan4\nexpr: avg by(cluster)(rate({app=\"acp-service-srs\"} |= \"mos\" | json | State_mos<4[10s]))\n" err="empty ring"
level=info ts=2022-06-27T14:39:49.3052792Z caller=metrics.go:122 component=ruler org_id=fake latency=fast query="avg by(cluster)(rate({app=\"acp-service-srs\"} |= \"mos\" | json | State_PacketLossPercentage>0[10s]))" query_type=metric range_type=instant length=0s step=0s duration=49.5µs status=500 limit=0 returned_lines=0 throughput=0B total_bytes=0B queue_time=0s subqueries=1
level=warn ts=2022-06-27T14:39:49.3053241Z caller=manager.go:610 user=fake group=Loki msg="Evaluating rule failed" rule="alert: SimpleRecordingServicePacketLossPercentageGreaterThan0\nexpr: avg by(cluster)(rate({app=\"acp-service-srs\"} |= \"mos\" | json | State_PacketLossPercentage>0[10s]))\n" err="empty ring"
level=info ts=2022-06-27T14:39:49.3055688Z caller=metrics.go:122 component=ruler org_id=fake latency=fast query="(count_over_time({app=\"acp-service-srs\"} |= \"Error\" != \"Warning\"[1m]) > 1)" query_type=metric range_type=instant length=0s step=0s duration=133.1µs status=500 limit=0 returned_lines=0 throughput=0B total_bytes=0B queue_time=0s subqueries=1
level=warn ts=2022-06-27T14:39:49.3056162Z caller=manager.go:610 user=fake group=Loki msg="Evaluating rule failed" rule="alert: SimpleRecordingServiceErrorsAndWarningsCount\nexpr: (count_over_time({app=\"acp-service-srs\"} |= \"Error\" != \"Warning\"[1m]) > 1)\nfor: 10s\n" err="empty ring"
ts=2022-06-27T14:40:03.3543264Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve loki-memberlist.loki.svc.cluster.local: lookup loki-memberlist.loki.svc.cluster.local on 10.96.0.10:53: no such host"
level=info ts=2022-06-27T14:40:21.7312888Z caller=metrics.go:122 component=ruler org_id=fake latency=fast query="(count_over_time({app=~\"prometheus\"}[1m]) > 0)" query_type=metric range_type=instant length=0s step=0s duration=125.499µs status=500 limit=0 returned_lines=0 throughput=0B total_bytes=0B queue_time=0s subqueries=1
level=warn ts=2022-06-27T14:40:21.7313421Z caller=manager.go:610 user=fake group=Loki msg="Evaluating rule failed" rule="alert: LokiAlwaysFailingTest\nexpr: (count_over_time({app=~\"prometheus\"}[1m]) > 0)\nlabels:\n severity: critical\nannotations:\n message: |\n This alert should always be firing in dev\n" err="empty ring"
level=info ts=2022-06-27T14:40:21.7315089Z caller=metrics.go:122 component=ruler org_id=fake latency=fast query="(absent_over_time({cluster=~\".+\"}[1m]) == 1)" query_type=metric range_type=instant length=0s step=0s duration=53.9µs status=500 limit=0 returned_lines=0 throughput=0B total_bytes=0B queue_time=0s subqueries=1
level=warn ts=2022-06-27T14:40:21.7315527Z caller=manager.go:610 user=fake group=Loki msg="Evaluating rule failed" rule="alert: LokiNoLogsFoundForAnyCluster\nexpr: (absent_over_time({cluster=~\".+\"}[1m]) == 1)\nfor: 5m\nlabels:\n severity: critical\nannotations:\n message: |\n Loki is reporting no logs received from any cluster in the last 5 minutes\n" err="empty ring"
level=info ts=2022-06-27T14:40:21.7317404Z caller=metrics.go:122 component=ruler org_id=fake latency=fast query="(rate({cluster=~\".+\", job=\"kube-prometheus-stack/thanos\"} |= \"update of node failed\"[1m]) > 0.01)" query_type=metric range_type=instant length=0s step=0s duration=91.2µs status=500 limit=0 returned_lines=0 throughput=0B total_bytes=0B queue_time=0s subqueries=1
level=warn ts=2022-06-27T14:40:21.7317749Z caller=manager.go:610 user=fake group=Loki msg="Evaluating rule failed" rule="alert: ThanosQueryNoData\nexpr: (rate({cluster=~\".+\", job=\"kube-prometheus-stack/thanos\"} |= \"update of node\n failed\"[1m]) > 0.01)\nlabels:\n severity: critical\nannotations:\n message: |\n Thanos has failed to update nodes/collect logs within the last minute\n" err="empty ring"
level=info ts=2022-06-27T14:40:49.3050531Z caller=metrics.go:122 component=ruler org_id=fake latency=fast query="avg by(cluster)(rate({app=\"acp-service-srs\"} |= \"mos\" | json | State_mos<4[10s]))" query_type=metric range_type=instant length=0s step=0s duration=126.699µs status=500 limit=0 returned_lines=0 throughput=0B total_bytes=0B queue_time=0s subqueries=1
level=warn ts=2022-06-27T14:40:49.3051283Z caller=manager.go:610 user=fake group=Loki msg="Evaluating rule failed" rule="alert: SimpleRecordingServiceMosScoreLessThan4\nexpr: avg by(cluster)(rate({app=\"acp-service-srs\"} |= \"mos\" | json | State_mos<4[10s]))\n" err="empty ring"
level=info ts=2022-06-27T14:40:49.3053076Z caller=metrics.go:122 component=ruler org_id=fake latency=fast query="avg by(cluster)(rate({app=\"acp-service-srs\"} |= \"mos\" | json | State_PacketLossPercentage>0[10s]))" query_type=metric range_type=instant length=0s step=0s duration=64.6µs status=500 limit=0 returned_lines=0 throughput=0B total_bytes=0B queue_time=0s subqueries=1
level=warn ts=2022-06-27T14:40:49.3053434Z caller=manager.go:610 user=fake group=Loki msg="Evaluating rule failed" rule="alert: SimpleRecordingServicePacketLossPercentageGreaterThan0\nexpr: avg by(cluster)(rate({app=\"acp-service-srs\"} |= \"mos\" | json | State_PacketLossPercentage>0[10s]))\n" err="empty ring"
level=info ts=2022-06-27T14:40:49.3054569Z caller=metrics.go:122 component=ruler org_id=fake latency=fast query="(count_over_time({app=\"acp-service-srs\"} |= \"Error\" != \"Warning\"[1m]) > 1)" query_type=metric range_type=instant length=0s step=0s duration=44.9µs status=500 limit=0 returned_lines=0 throughput=0B total_bytes=0B queue_time=0s subqueries=1
level=warn ts=2022-06-27T14:40:49.3054907Z caller=manager.go:610 user=fake group=Loki msg="Evaluating rule failed" rule="alert: SimpleRecordingServiceErrorsAndWarningsCount\nexpr: (count_over_time({app=\"acp-service-srs\"} |= \"Error\" != \"Warning\"[1m]) > 1)\nfor: 10s\n" err="empty ring"
level=info ts=2022-06-27T14:40:56.6814446Z caller=memberlist_client.go:542 msg="joined memberlist cluster" reached_nodes=2
level=info ts=2022-06-27T14:41:21.7320922Z caller=metrics.go:122 component=ruler org_id=fake latency=fast query="(count_over_time({app=~\"prometheus\"}[1m]) > 0)" query_type=metric range_type=instant length=0s step=0s duration=250.999µs status=500 limit=0 returned_lines=0 throughput=0B total_bytes=0B queue_time=0s subqueries=1
level=warn ts=2022-06-27T14:41:21.732152Z caller=manager.go:610 user=fake group=Loki msg="Evaluating rule failed" rule="alert: LokiAlwaysFailingTest\nexpr: (count_over_time({app=~\"prometheus\"}[1m]) > 0)\nlabels:\n severity: critical\nannotations:\n message: |\n This alert should always be firing in dev\n" err="too many unhealthy instances in the ring"
level=info ts=2022-06-27T14:41:21.7324097Z caller=metrics.go:122 component=ruler org_id=fake latency=fast query="(absent_over_time({cluster=~\".+\"}[1m]) == 1)" query_type=metric range_type=instant length=0s step=0s duration=72.5µs status=500 limit=0 returned_lines=0 throughput=0B total_bytes=0B queue_time=0s subqueries=1
level=warn ts=2022-06-27T14:41:21.7324569Z caller=manager.go:610 user=fake group=Loki msg="Evaluating rule failed" rule="alert: LokiNoLogsFoundForAnyCluster\nexpr: (absent_over_time({cluster=~\".+\"}[1m]) == 1)\nfor: 5m\nlabels:\n severity: critical\nannotations:\n message: |\n Loki is reporting no logs received from any cluster in the last 5 minutes\n" err="too many unhealthy instances in the ring"
level=info ts=2022-06-27T14:41:21.7326536Z caller=metrics.go:122 component=ruler org_id=fake latency=fast query="(rate({cluster=~\".+\", job=\"kube-prometheus-stack/thanos\"} |= \"update of node failed\"[1m]) > 0.01)" query_type=metric range_type=instant length=0s step=0s duration=84.6µs status=500 limit=0 returned_lines=0 throughput=0B total_bytes=0B queue_time=0s subqueries=1
level=warn ts=2022-06-27T14:41:21.732693Z caller=manager.go:610 user=fake group=Loki msg="Evaluating rule failed" rule="alert: ThanosQueryNoData\nexpr: (rate({cluster=~\".+\", job=\"kube-prometheus-stack/thanos\"} |= \"update of node\n failed\"[1m]) > 0.01)\nlabels:\n severity: critical\nannotations:\n message: |\n Thanos has failed to update nodes/collect logs within the last minute\n" err="too many unhealthy instances in the ring"
level=info ts=2022-06-27T14:41:49.3056573Z caller=metrics.go:122 component=ruler org_id=fake latency=fast query="avg by(cluster)(rate({app=\"acp-service-srs\"} |= \"mos\" | json | State_mos<4[10s]))" query_type=metric range_type=instant length=0s step=0s duration=182.8µs status=500 limit=0 returned_lines=0 throughput=0B total_bytes=0B queue_time=0s subqueries=1
level=warn ts=2022-06-27T14:41:49.3057163Z caller=manager.go:610 user=fake group=Loki msg="Evaluating rule failed" rule="alert: SimpleRecordingServiceMosScoreLessThan4\nexpr: avg by(cluster)(rate({app=\"acp-service-srs\"} |= \"mos\" | json | State_mos<4[10s]))\n" err="too many unhealthy instances in the ring"
Here are the list of services Loki has:
kubectl get services -n loki
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
loki-gateway ClusterIP 10.96.78.215 <none> 3100/TCP 35m
loki-memberlist ClusterIP None <none> 7946/TCP 35m
loki-read ClusterIP 10.96.20.165 <none> 3100/TCP,9095/TCP 35m
loki-read-headless ClusterIP None <none> 3100/TCP,9095/TCP 35m
loki-write ClusterIP 10.96.52.84 <none> 3100/TCP,9095/TCP 35m
loki-write-headless ClusterIP None <none> 3100/TCP,9095/TCP 35m
it doesn't look like the ruler ring is configured correctly. could you include the actual rendered config in the configmap deployed by the helm chart?
Hi. any updates ?
I am getting the same error with loki-simple-scalable-1.4.1 with loki-simple-scalable and object storage (not aws but similar) as backend.
I am not using the loki-gateway, but I am connecting directly the service loki-read from grafana. What is strange in my case: Grafana first is able to connect loki, but after some time it is not able anymore. So the error occurs not immediately.
As I have a small environment only, I am running 3 replicas of loki-write and 1 replica of loki-read.
Grafana is logging (when testing the datasource):
logger=context traceID=00000000000000000000000000000000 userId=1 orgId=1 uname=admin t=2022-09-29T13:28:21.723434589Z level=error msg="Failed to call resource" error="too many unhealthy instances in the ring\n" traceID=00000000000000000000000000000000
logger=context traceID=00000000000000000000000000000000 userId=1 orgId=1 uname=admin t=2022-09-29T13:28:21.723527353Z level=error msg="Request Completed" method=GET path=/api/datasources/6/resources/labels status=500 remote_addr=83.135.39.98 time_ms=22 duration=22.424318ms size=83 referer=https://grafana.xxxxxx/datasources/edit/loki-k09 traceID=00000000000000000000000000000000 handler=/api/datasources/:id/resources/*
loki-read is logging:
level=warn ts=2022-09-28T20:43:09.587511414Z caller=pool.go:184 msg="removing ingester failing healthcheck" addr=172.25.1.69:9095 reason="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
loki-write is logging:
level=warn ts=2022-09-29T13:41:22.082927033Z caller=logging.go:72 traceID=7b27e088009f9574 orgID=fake msg="POST /loki/api/v1/push (500) 6.437119ms Response: \"at least 2 live replicas required, could only find 1 - unhealthy instances: 172.25.1.69:9095,172.25.2.146:9095\\n\" ws: false; Content-Length: 145358; Content-Type: application/x-protobuf; User-Agent: promtail/2.5.0;"
I am going to delete the loki-instance and deploy a new one based on Chart version 1.8.11 - to see if it's better then.
EDIT: actually it is working, I will watch this and give another feedback then
I tried to run one replica each for read/write and I was getting the same error - latest chart version. I'm new to Loki so don't understand it very well.
Got similiar error messages and had to adjust the
commonConfig: replication_factor: 1
to match my number of instances.
The number changed from original 2 (because i only have two nodes and the default replication of 3 was too much) to than 1 because after changing the storage type from s3 (default) to filesystem, everything changed (read,write and gateway were gone.
Similar problem, this is what worked for me, copy from https://github.com/grafana/loki/issues/10537#issuecomment-1759899640:
On values.yaml of the Helm chart added:
loki:
commonConfig:
# set to 1, otherwise more replicas are needed to connect to grafana
replication_factor: 1
And was able to set the rest to 1:
write:
replicas: 1
persistence:
storageClass: gp2
read:
replicas: 1
persistence:
storageClass: gp2
backend:
replicas: 1
persistence:
storageClass: gp2
@Ca-moes From the documentation the replication_factor set to 1 is for monothlic mode, I'm currently using the simple scalable mode so replication factor must be greater than 1