loki icon indicating copy to clipboard operation
loki copied to clipboard

``"/schedulerpb.SchedulerForQuerier/QuerierLoop`` route latency

Open LinTechSo opened this issue 2 years ago • 7 comments

Describe the bug Loki version 2.6.0 Kubernetes 1.20 Single binary deployment I'm monitoring Loki's latencies with this promql

histogram_quantile(0.95, sum(rate(loki_request_duration_seconds_bucket{instance="$instance", cluster="$cluster", namespace="$namespace"}[5m])) by (le,route))

but most of the time I get 2 min latency on "/schedulerpb.SchedulerForQuerier/QuerierLoop route i Don't know the related configuration on this route I test many configs and read many issues but nothing helps

my Loki configuration

    auth_enabled: true
    chunk_store_config:
      max_look_back_period: 0s
    compactor:
      compaction_interval: 10m
      retention_delete_delay: 2h
      retention_delete_worker_count: 150
      retention_enabled: true
      shared_store: s3
      working_directory: /data/loki/boltdb-shipper-compactor
    frontend:
      compress_responses: true
      max_outstanding_per_tenant: 4096
    ingester:
      chunk_block_size: 262144
      chunk_idle_period: 3m
      chunk_retain_period: 1m
      flush_op_timeout: 10m
      lifecycler:
        ring:
          kvstore:
            store: inmemory
          replication_factor: 1
      max_transfer_retries: 0
      wal:
        dir: /data/loki/wal
    limits_config:
      enforce_metric_name: false
      max_query_series: 100000
      max_query_parallelism: 96
      reject_old_samples: true
      reject_old_samples_max_age: 168h
      retention_period: 24h
    querier:
      max_concurrent: 20
      query_timeout: 5m
      engine:
        timeout: 5m
    query_range:
      align_queries_with_step: true
    query_scheduler:
      max_outstanding_requests_per_tenant: 4096
      grpc_client_config:
        max_recv_msg_size: 184857600
        max_send_msg_size: 184857600
    runtime_config:
      file: /opt/overrides.yaml
      period: 20s
    schema_config:
      configs:
      - from: "2020-10-24"
        index:
          period: 24h
          prefix: index_
        object_store: s3
        schema: v11
        store: boltdb-shipper
    server:
      http_server_read_timeout: 300s
      http_server_write_timeout: 300s
      http_listen_port: 3100
      grpc_server_max_recv_msg_size: 7894704
      grpc_server_max_send_msg_size: 7894704
    storage_config:
      aws:
        access_key_id: yyyyyyy
        bucketnames: loki
        endpoint: loki-minio.default.svc:9000
        insecure: true
        s3forcepathstyle: true
        secret_access_key: xxxxx
        http_config:
          response_header_timeout: 60s
      boltdb_shipper:
        active_index_directory: /data/loki/boltdb-shipper-active
        cache_location: /data/loki/boltdb-shipper-cache
        cache_ttl: 24h
        shared_store: s3
    table_manager:
      retention_deletes_enabled: false

I think we should discuss this issue my Loki error logs: msg=" error notifying frontend about finished query" Loki does not produce a helpful error log which is a pity. the problem still persists from the 2.5.0 version.

LinTechSo avatar Jul 19 '22 07:07 LinTechSo

How large is your data? 2 minutes for an expensive query is not too much. I think setting the right time outs would help.

jeschkies avatar Jul 20 '22 14:07 jeschkies

Hi . thanks for the reply actually it is not too large. and also i set retention on every 24h. which time out config ? querier.timeout ? In my opinion, increasing the timeout configuration is not a solution. Don't get to the core of the problem which is slow performance.

LinTechSo avatar Jul 20 '22 14:07 LinTechSo

actually it is not too large.

That depends. For the histogram it could be. How are you running Loki? Could create a profile via go tool pprof "http://<Loki Host>/debug/pprof/profile"

jeschkies avatar Jul 21 '22 12:07 jeschkies

actually it is not too large.

That depends. For the histogram it could be. How are you running Loki? Could create a profile via go tool pprof "http://<Loki Host>/debug/pprof/profile"

Thanks. I'm running multi-tenant Loki with s3 backend storage on Kubernetes cluster deployed via single binary mode with helm chart. with a configuration that I have sent. my clients using Loki as short-term log storage. The latency on the "/schedulerpb.SchedulerForQuerier/QuerierLoop" path sometimes occurred when I used a stream selector in a time frame of more than 24h for logs that entered Loki with a rate of 1 log per second for each tenant. my question is what exactly does this path "/schedulerpb.SchedulerForQuerier/QuerierLoop" do? and what can I do about it to increase performance and decrease this latency?

LinTechSo avatar Jul 22 '22 20:07 LinTechSo

Hm, this is interesting. In a microservice setup you'd have a query frontend, a scheduler and querier. When a query is received by the query frontend it sends it on to the scheduler which queues it until it is dequeued by the querier. Once the querier process the query request it answers to the query frontend which in return fulfills the query request.

The QuerierLoop is the connection between the scheduler and querier. I'm surprised it shows up in you single binary mode at all. Also, the latency should not impact query performance much as these are just the query requests send.

I'm not too familiar with the Helm charts. @trevorwhitney would you know why the scheduler is started in single binary mode?

jeschkies avatar Jul 25 '22 17:07 jeschkies

So if I understand correctly when we deploy Loki as a single binary mode, we don't have a frontend at all, and the scheduler should be disabled by default. @trevorwhitney would you please give us an update about this topic?

LinTechSo avatar Aug 02 '22 06:08 LinTechSo

and the scheduler should be disabled by default.

Yes

jeschkies avatar Aug 02 '22 16:08 jeschkies

@jeschkies and @LinTechSo sorry for the long delay on this one. I don't know off hand why we put a scheduler in the single binary. Seems reasonable to just go Frontend -> Querier in single binary mode.

trevorwhitney avatar Aug 22 '22 20:08 trevorwhitney

hijacking this one, as we have the exact same symptoms using the Scalable monolithic deployment mode.

K8s v1.26.3 Loki 2.9.1

The LokiRequestLatency alert deployed by the helm chart fires constantly for the "/schedulerpb.SchedulerForQuerier/QuerierLoop" route if someone does continuous requests / queries. (such as our own tooling or dashboards in Grafana)

The latency on our side is also permanently 100s, which seems unreasonable as the queries are answered in less than 1 second.

The alert seems for this route seems pretty useless on our side as we don't know how to handle it and there is no visible impact here.

alxndr13 avatar Sep 28 '23 13:09 alxndr13

I noticed the LokiRequestLatency rule is defined differently here: https://github.com/grafana/loki/blob/v2.9.3/production/loki-mixin/alerts.libsonnet#L42

alert: LokiRequestLatency
annotations:
  message: |
    {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf "%.2f" $value }}s 99th percentile latency.
expr: |
  cluster_namespace_job_route:loki_request_duration_seconds:99quantile{route!~"(?i).*tail.*|/schedulerpb.SchedulerForQuerier/QuerierLoop"} > 1
for: 15m
labels:
  severity: critical

Notice how it excludes the /schedulerpb.SchedulerForQuerier/QuerierLoop route. Seems like the Helm chart needs to be updated.

anarsen avatar Dec 19 '23 08:12 anarsen

I can see how someone who isn't concerned about Helm could miss this when doing a change in the .libsonnet file. 😅

There's in fact two changes between these files. The other change is using cluster_namespace_job_route instead of namespace_job_route in the expr. Not sure if that's intentionally different, otherwise I'd have already opened a MR to fix this.

image

(And the Helm chart has an additional LokiCanaryLatency alert that isn't represented in the .libsonnet but I don't think that's relevant.)

icedream avatar Feb 01 '24 10:02 icedream