loki failed to record query metrics

Describe the bug

Found this in my logs today:

level=error ts=2022-05-23T22:34:03.624146813Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
level=error ts=2022-05-23T22:34:03.624354084Z caller=stats.go:57 component=frontend msg="failed to record query metrics" err="expected one of the *LokiRequest, *LokiInstantRequest, *LokiSeriesRequest, *LokiLabelNamesRequest, got "

To Reproduce

I'm not entirely sure, but I suspect it occurs when a query times out. We saw a peak in queries timing at at the same time and at roughly the same rate.

Note

If you give me some hints I might be able to investigate this myself.

May 24 '22 00:05 splitice

Hello,

I have the same problem when I add Loki as a datasource for Grafana. What is the main cause of this problem please?

Jul 29 '22 13:07 nassissme

Hi , I also faced such a problem


level=error ts=2022-08-08T10:59:52.714885545Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
level=error ts=2022-08-08T10:59:52.715177492Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
level=error ts=2022-08-08T10:59:52.715565353Z caller=stats.go:57 component=frontend msg="failed to record query metrics" err="expected one of the *LokiRequest, *LokiInstantRequest, *LokiSeriesRequest, *LokiLabelNamesRequest, got "

Aug 08 '22 11:08 jackmachine3d

Hello , the problem is still relevant.I got this logs from Loki container when I try to add Loki as Grafana datasource:

ts=2022-10-10T14:20:59.979660381Z caller=spanlogger.go:80 user=fake method=query.Label level=info org_id=fake latency=fast query_type=labels length=10m0s duration=1.053612ms status=200 label= throughput=0B total_bytes=0B total_entries=3
level=warn ts=2022-10-10T14:21:01.172823779Z caller=pool.go:184 msg="removing frontend failing healthcheck" addr=10.0.0.237:9095 reason="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
level=error ts=2022-10-10T14:21:01.173065496Z caller=scheduler_processor.go:182 org_id=fake msg="error notifying frontend about finished query" err="rpc error: code = Canceled desc = grpc: the client connection is closing" frontend=10.0.0.237:9095
level=error ts=2022-10-10T14:21:29.977344389Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="context canceled"
level=error ts=2022-10-10T14:21:29.97755391Z caller=stats.go:57 component=frontend msg="failed to record query metrics" err="expected one of the *LokiRequest, *LokiInstantRequest, *LokiSeriesRequest, *LokiLabelNamesRequest, got "

In Grafana UI i get this message:

Unable to fetch labels from Loki (Failed to call resource), please check the server logs for more details

Oct 10 '22 14:10 klimonoid

+1

I'm experiencing the same issue running Loki and Grafana in Kubernetes using the official Helm charts. In case it helps, the versions I'm running are:

Loki - 2.6.1 Loki chart - 3.2.1 Grafana - 9.2.0 Grafana chart - 6.42.2

Grafana's logs:

logger=sqlstore.transactions t=2022-10-24T13:29:18.472476894Z level=info msg="Database locked, sleeping then retrying" error="database is locked" retry=0
logger=context userId=1 orgId=1 uname=root t=2022-10-24T13:29:19.991946061Z level=error msg="Failed to call resource" error="too many unhealthy instances in the ring\n" traceID=
logger=context userId=1 orgId=1 uname=root t=2022-10-24T13:29:19.99206549Z level=error msg="Request Completed" method=GET path=/api/datasources/17/resources/labels status=500 remote_addr=10.68.70.1 time_ms=343 duration=343.734556ms size=51 referer=https://redacted.com/datasources/edit/vuDSphI4z handler=/api/datasources/:id/resources/*

Loki's logs: (read replica)

level=info ts=2022-10-24T13:26:50.473821988Z caller=table_manager.go:252 msg="query readiness setup completed" duration=3.636µs distinct_users_len=0
level=error ts=2022-10-24T13:28:37.006031591Z caller=retry.go:73 org_id=1 msg="error processing request" try=0 err="context canceled"
level=error ts=2022-10-24T13:28:37.02312194Z caller=stats.go:57 component=frontend msg="failed to record query metrics" err="expected one of the *LokiRequest, *LokiInstantRequest, *LokiSeriesRequest, *LokiLabelNamesRequest, got "
ts=2022-10-24T13:29:19.979549206Z caller=spanlogger.go:80 user=1 method=query.Label level=info org_id=1 latency=fast query_type=labels length=10m0s duration=33.476µs status=500 label= throughput=0B total_bytes=0B total_entries=0
level=info ts=2022-10-24T13:31:50.473071427Z caller=table_manager.go:213 msg="syncing tables"

Also, the Loki Grafana agent/operator's logs, which honestly I do not understand one bit. But it might help. It's the same message - "reconciling node" - on loop.

level=info ts=2022-10-24T13:38:33.923185679Z component=controller.node name=gke-prod-cluster-prod-pool-regular-0-678744de-ukun namespace= msg="reconciling node"

Oct 24 '22 13:10 jmuleiro

did you run any node upgrades? anyway, check /ring and /distributor/ring pages on Loki and click to forget all unhealthy ring members

Oct 24 '22 13:10 DylanGuedes

My bad here, but I never got to set up those pages you're describing. Pretty sure I'm missing some configuration, since if I access my Loki gateway's /ring or /distributor/ring, it returns 404s.

I've seen comments about "forgetting" unhealthy ring members, but couldn't find out how to do so. Could you point me in the right direction?

Oct 24 '22 13:10 jmuleiro

these pages are exposed by the write replicas, maybe you're trying the read ones.

in case you're using kubernetes: use port-forward to access them, I'm not sure the nginx/gateway expose them the same way

Oct 24 '22 13:10 DylanGuedes

Yes, I'm running it in Kubernetes.

I will try the port-forward approach first, thank you! :)

Oct 24 '22 14:10 jmuleiro

I have this problem in Swarm. But the most interesting thing is that sometimes when i deploy the stack it doesn’t occur. But most of the times it raises.

Oct 24 '22 15:10 klimonoid

Okay, this made me feel dumb.

In case anyone is wondering, or needs to know in the future: The "too many unhealthy instances in the ring" log will also show up if you're using Loki with scalability, which is the default behavior, but fixated at one replica. This caused errors both when setting the read and write StatefulSets to 1 replica.

After trying the port-forward and verifying that, as I thought, no unhealthy replicas were actually present, I tried redeploying with the following configuration in the Helm values.yaml for Loki:

# --- Autoscaling read/write configurations
write:
  replicas: 2 # updated - previously 1
  # Mount Cloud Storage credentials Secret
  extraEnv:
    - name: GOOGLE_APPLICATION_CREDENTIALS
      value: "/etc/loki_secrets/loki_backend.json"
  extraVolumeMounts:
    - name: loki-secrets
      mountPath: "/etc/loki_secrets"
  extraVolumes:
    - name: loki-secrets
      secret:
        secretName: loki-bucket-access
        items:
          - key: loki_backend.json
            path: loki_backend.json
  resources:
    requests:
      cpu: 50m
      memory: 100Mi
    limits:
      cpu: 300m
      memory: 500Mi
  persistence:
    size: 10Gi
    storageClass: elastic

read:
  replicas: 2
  # Mount Cloud Storage credentials Secret
  extraEnv:
    - name: GOOGLE_APPLICATION_CREDENTIALS
      value: "/etc/loki_secrets/loki_backend.json"
  extraVolumeMounts:
    - name: loki-secrets
      mountPath: "/etc/loki_secrets"
  extraVolumes:
    - name: loki-secrets
      secret:
        secretName: loki-bucket-access
        items:
          - key: loki_backend.json
            path: loki_backend.json
  autoscaling:
    enabled: true
    minReplicas: 1
    maxReplicas: 3
    targetCPUUtilizationPercentage: 80
    targetMemoryUtilizationPercentage: 70
  resources:
    requests:
      cpu: 50m
      memory: 100Mi
    limits:
      cpu: 250m
      memory: 1Gi
  persistence:
    size: 10Gi
    storageClass: elastic
  affinity: ""

I will leave the full configuration snippet (that is, for the read & write objects) since most of it was hard to find and/or understand, at least for me. Specially the GCS configuration.

Perhaps the "too many unhealthy instances in the ring" message is too broad. It would be more helpful if it was specific. Just a thought.

Oct 24 '22 15:10 jmuleiro

Same with grafana/loki:2.6.1 and grafana/grafana:9.1.6-ubuntu for me right now - single instance of Loki, ring/distributor ring both ACTIVE. Everything used to work before upgrading from 2.3 to 2.6.1, issue seems related to https://github.com/grafana/loki/issues/4756.

component=frontend msg="failed to record query metrics" err="expected one of the *LokiRequest, *LokiInstantRequest, *LokiSeriesRequest, *LokiLabelNamesRequest, got "

21:05:51.238169 IP (tos 0x0, ttl 64, id 2711, offset 0, flags [DF], proto TCP (6), length 210)
    ip-172-18-0-2.ap-northeast-1.compute.internal.35384 > ip-172-18-0-5.ap-northeast-1.compute.internal.exlm-agent: Flags [P.], cksum 0x58f0 (incorrect -> 0x6efe), seq 0:158, ack 1, win 270, options [nop,nop,TS val 345130930 ecr 765894484], length 158
E...
.@[email protected]......
..G.-..TGET /loki/loki/api/v1/labels?start=1667336150463000000&end=1667336750463000000 HTTP/1.1
Host: loki:3002
User-Agent: Grafana/9.1.6
Accept-Encoding: gzip

auth_enabled: false

server:
  http_listen_port: 3002
  grpc_listen_port: 9096
  http_path_prefix: /loki
  log_level: warn

distributor:
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 1970-01-01
      store: boltdb-shipper
      object_store: s3
      schema: v11
      index:
        prefix: index_
        period: 24h
    - from: 2022-09-30
      store: boltdb-shipper
      object_store: s3
      schema: v12
      index:
        prefix: index_v12_
        period: 24h

storage_config:
  boltdb_shipper:
    active_index_directory: /data/index
    cache_location: /data/index_cache
    resync_interval: 5s
    shared_store: s3
  aws:
    bucketnames: loki
    access_key_id: minioadmin
    secret_access_key: minioadmin
    endpoint: minio:9000
    insecure: true
    s3forcepathstyle: true

ruler:
  enable_api: true
  alertmanager_url: http://mimir:8080/mimir/alertmanager
  rule_path: /data/ruler
  wal:
    dir: /data/ruler-wal
  storage:
    type: s3
    s3:
      bucketnames: loki-ruler
      access_key_id: minioadmin
      secret_access_key: minioadmin
      endpoint: minio:9000
      insecure: true
      s3forcepathstyle: true
  ring:
    kvstore:
      store: inmemory

ingester:
  max_transfer_retries: 0
  wal:
    enabled: true
    dir: /data/ingester-wal
  lifecycler:
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1

compactor:
  working_directory: /data/compactor
  shared_store: s3
  retention_enabled: true
  deletion_mode: filter-and-delete

limits_config:
  retention_period: 744h
  allow_deletes: true

Nov 01 '22 20:11 Q-Hilbtec

Guys try to add

frontend: instance_addr: 127.0.0.1

to loki config. Found that on multiple issues here

Nov 01 '22 20:11 klimonoid

@QuentinM-Hilbtec I could be mistaken, but as I understand it, if using read and write instead of singleBinary, Loki will return "too many unhealthy replicas" if either the read or write ReplicaSets are set to one replica. Maybe try setting replicas: 2 to test it?

Nov 01 '22 20:11 jmuleiro

Regarding the "too many unhealthy instances in the ring" error message, rather than creating additional replicas, another solution is "replication_factor" property to 1. Not saying this is recommended, especially for production, but this works to get past this error.

May 16 '23 17:05 jantoine1

loki loki copied to clipboard

failed to record query metrics

loki
loki copied to clipboard