loki
loki copied to clipboard
failed to record query metrics
Describe the bug
Found this in my logs today:
level=error ts=2022-05-23T22:34:03.624146813Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
level=error ts=2022-05-23T22:34:03.624354084Z caller=stats.go:57 component=frontend msg="failed to record query metrics" err="expected one of the *LokiRequest, *LokiInstantRequest, *LokiSeriesRequest, *LokiLabelNamesRequest, got "
To Reproduce
I'm not entirely sure, but I suspect it occurs when a query times out. We saw a peak in queries timing at at the same time and at roughly the same rate.
Note
If you give me some hints I might be able to investigate this myself.
Hello,
I have the same problem when I add Loki as a datasource for Grafana. What is the main cause of this problem please?
Hi , I also faced such a problem
level=error ts=2022-08-08T10:59:52.714885545Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
level=error ts=2022-08-08T10:59:52.715177492Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
level=error ts=2022-08-08T10:59:52.715565353Z caller=stats.go:57 component=frontend msg="failed to record query metrics" err="expected one of the *LokiRequest, *LokiInstantRequest, *LokiSeriesRequest, *LokiLabelNamesRequest, got "
Hello , the problem is still relevant.I got this logs from Loki container when I try to add Loki as Grafana datasource:
ts=2022-10-10T14:20:59.979660381Z caller=spanlogger.go:80 user=fake method=query.Label level=info org_id=fake latency=fast query_type=labels length=10m0s duration=1.053612ms status=200 label= throughput=0B total_bytes=0B total_entries=3 level=warn ts=2022-10-10T14:21:01.172823779Z caller=pool.go:184 msg="removing frontend failing healthcheck" addr=10.0.0.237:9095 reason="rpc error: code = DeadlineExceeded desc = context deadline exceeded" level=error ts=2022-10-10T14:21:01.173065496Z caller=scheduler_processor.go:182 org_id=fake msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = grpc: the client connection is closing" frontend=10.0.0.237:9095 level=error ts=2022-10-10T14:21:29.977344389Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled" level=error ts=2022-10-10T14:21:29.97755391Z caller=stats.go:57 component=frontend msg="failed to record query metrics" err="expected one of the *LokiRequest, *LokiInstantRequest, *LokiSeriesRequest, *LokiLabelNamesRequest, got "
In Grafana UI i get this message:
Unable to fetch labels from Loki (Failed to call resource), please check the server logs for more details
+1
I'm experiencing the same issue running Loki and Grafana in Kubernetes using the official Helm charts. In case it helps, the versions I'm running are:
Loki - 2.6.1 Loki chart - 3.2.1 Grafana - 9.2.0 Grafana chart - 6.42.2
Grafana's logs:
logger=sqlstore.transactions t=2022-10-24T13:29:18.472476894Z level=info msg="Database locked, sleeping then retrying" error="database is locked" retry=0
logger=context userId=1 orgId=1 uname=root t=2022-10-24T13:29:19.991946061Z level=error msg="Failed to call resource" error="too many unhealthy instances in the ring\n" traceID=
logger=context userId=1 orgId=1 uname=root t=2022-10-24T13:29:19.99206549Z level=error msg="Request Completed" method=GET path=/api/datasources/17/resources/labels status=500 remote_addr=10.68.70.1 time_ms=343 duration=343.734556ms size=51 referer=https://redacted.com/datasources/edit/vuDSphI4z handler=/api/datasources/:id/resources/*
Loki's logs: (read replica)
level=info ts=2022-10-24T13:26:50.473821988Z caller=table_manager.go:252 msg="query readiness setup completed" duration=3.636µs distinct_users_len=0
level=error ts=2022-10-24T13:28:37.006031591Z caller=retry.go:73 org_id=1 msg="error processing request" try=0 err="context canceled"
level=error ts=2022-10-24T13:28:37.02312194Z caller=stats.go:57 component=frontend msg="failed to record query metrics" err="expected one of the *LokiRequest, *LokiInstantRequest, *LokiSeriesRequest, *LokiLabelNamesRequest, got "
ts=2022-10-24T13:29:19.979549206Z caller=spanlogger.go:80 user=1 method=query.Label level=info org_id=1 latency=fast query_type=labels length=10m0s duration=33.476µs status=500 label= throughput=0B total_bytes=0B total_entries=0
level=info ts=2022-10-24T13:31:50.473071427Z caller=table_manager.go:213 msg="syncing tables"
Also, the Loki Grafana agent/operator's logs, which honestly I do not understand one bit. But it might help. It's the same message - "reconciling node" - on loop.
level=info ts=2022-10-24T13:38:33.923185679Z component=controller.node name=gke-prod-cluster-prod-pool-regular-0-678744de-ukun namespace= msg="reconciling node"
did you run any node upgrades? anyway, check /ring and /distributor/ring pages on Loki and click to forget all unhealthy ring members
My bad here, but I never got to set up those pages you're describing. Pretty sure I'm missing some configuration, since if I access my Loki gateway's /ring or /distributor/ring, it returns 404s.
I've seen comments about "forgetting" unhealthy ring members, but couldn't find out how to do so. Could you point me in the right direction?
these pages are exposed by the write replicas, maybe you're trying the read ones.
in case you're using kubernetes: use port-forward to access them, I'm not sure the nginx/gateway expose them the same way
Yes, I'm running it in Kubernetes.
I will try the port-forward approach first, thank you! :)
I have this problem in Swarm. But the most interesting thing is that sometimes when i deploy the stack it doesn’t occur. But most of the times it raises.
Okay, this made me feel dumb.
In case anyone is wondering, or needs to know in the future: The "too many unhealthy instances in the ring" log will also show up if you're using Loki with scalability, which is the default behavior, but fixated at one replica. This caused errors both when setting the read and write StatefulSets to 1 replica.
After trying the port-forward and verifying that, as I thought, no unhealthy replicas were actually present, I tried redeploying with the following configuration in the Helm values.yaml for Loki:
# --- Autoscaling read/write configurations
write:
replicas: 2 # updated - previously 1
# Mount Cloud Storage credentials Secret
extraEnv:
- name: GOOGLE_APPLICATION_CREDENTIALS
value: "/etc/loki_secrets/loki_backend.json"
extraVolumeMounts:
- name: loki-secrets
mountPath: "/etc/loki_secrets"
extraVolumes:
- name: loki-secrets
secret:
secretName: loki-bucket-access
items:
- key: loki_backend.json
path: loki_backend.json
resources:
requests:
cpu: 50m
memory: 100Mi
limits:
cpu: 300m
memory: 500Mi
persistence:
size: 10Gi
storageClass: elastic
read:
replicas: 2
# Mount Cloud Storage credentials Secret
extraEnv:
- name: GOOGLE_APPLICATION_CREDENTIALS
value: "/etc/loki_secrets/loki_backend.json"
extraVolumeMounts:
- name: loki-secrets
mountPath: "/etc/loki_secrets"
extraVolumes:
- name: loki-secrets
secret:
secretName: loki-bucket-access
items:
- key: loki_backend.json
path: loki_backend.json
autoscaling:
enabled: true
minReplicas: 1
maxReplicas: 3
targetCPUUtilizationPercentage: 80
targetMemoryUtilizationPercentage: 70
resources:
requests:
cpu: 50m
memory: 100Mi
limits:
cpu: 250m
memory: 1Gi
persistence:
size: 10Gi
storageClass: elastic
affinity: ""
I will leave the full configuration snippet (that is, for the read & write objects) since most of it was hard to find and/or understand, at least for me. Specially the GCS configuration.
Perhaps the "too many unhealthy instances in the ring" message is too broad. It would be more helpful if it was specific. Just a thought.
Same with grafana/loki:2.6.1 and grafana/grafana:9.1.6-ubuntu for me right now - single instance of Loki, ring/distributor ring both ACTIVE. Everything used to work before upgrading from 2.3 to 2.6.1, issue seems related to https://github.com/grafana/loki/issues/4756.
component=frontend msg="failed to record query metrics" err="expected one of the *LokiRequest, *LokiInstantRequest, *LokiSeriesRequest, *LokiLabelNamesRequest, got "
21:05:51.238169 IP (tos 0x0, ttl 64, id 2711, offset 0, flags [DF], proto TCP (6), length 210)
ip-172-18-0-2.ap-northeast-1.compute.internal.35384 > ip-172-18-0-5.ap-northeast-1.compute.internal.exlm-agent: Flags [P.], cksum 0x58f0 (incorrect -> 0x6efe), seq 0:158, ack 1, win 270, options [nop,nop,TS val 345130930 ecr 765894484], length 158
E...
.@[email protected]......
..G.-..TGET /loki/loki/api/v1/labels?start=1667336150463000000&end=1667336750463000000 HTTP/1.1
Host: loki:3002
User-Agent: Grafana/9.1.6
Accept-Encoding: gzip
auth_enabled: false
server:
http_listen_port: 3002
grpc_listen_port: 9096
http_path_prefix: /loki
log_level: warn
distributor:
ring:
kvstore:
store: inmemory
schema_config:
configs:
- from: 1970-01-01
store: boltdb-shipper
object_store: s3
schema: v11
index:
prefix: index_
period: 24h
- from: 2022-09-30
store: boltdb-shipper
object_store: s3
schema: v12
index:
prefix: index_v12_
period: 24h
storage_config:
boltdb_shipper:
active_index_directory: /data/index
cache_location: /data/index_cache
resync_interval: 5s
shared_store: s3
aws:
bucketnames: loki
access_key_id: minioadmin
secret_access_key: minioadmin
endpoint: minio:9000
insecure: true
s3forcepathstyle: true
ruler:
enable_api: true
alertmanager_url: http://mimir:8080/mimir/alertmanager
rule_path: /data/ruler
wal:
dir: /data/ruler-wal
storage:
type: s3
s3:
bucketnames: loki-ruler
access_key_id: minioadmin
secret_access_key: minioadmin
endpoint: minio:9000
insecure: true
s3forcepathstyle: true
ring:
kvstore:
store: inmemory
ingester:
max_transfer_retries: 0
wal:
enabled: true
dir: /data/ingester-wal
lifecycler:
ring:
kvstore:
store: inmemory
replication_factor: 1
compactor:
working_directory: /data/compactor
shared_store: s3
retention_enabled: true
deletion_mode: filter-and-delete
limits_config:
retention_period: 744h
allow_deletes: true
Guys try to add
frontend: instance_addr: 127.0.0.1
to loki config. Found that on multiple issues here
@QuentinM-Hilbtec I could be mistaken, but as I understand it, if using read and write instead of singleBinary, Loki will return "too many unhealthy replicas" if either the read or write ReplicaSets are set to one replica. Maybe try setting replicas: 2 to test it?
Regarding the "too many unhealthy instances in the ring" error message, rather than creating additional replicas, another solution is "replication_factor" property to 1. Not saying this is recommended, especially for production, but this works to get past this error.