loki
loki copied to clipboard
``"/schedulerpb.SchedulerForQuerier/QuerierLoop`` route latency
Describe the bug Loki version 2.6.0 Kubernetes 1.20 Single binary deployment I'm monitoring Loki's latencies with this promql
histogram_quantile(0.95, sum(rate(loki_request_duration_seconds_bucket{instance="$instance", cluster="$cluster", namespace="$namespace"}[5m])) by (le,route))
but most of the time I get 2 min latency on "/schedulerpb.SchedulerForQuerier/QuerierLoop
route
i Don't know the related configuration on this route
I test many configs and read many issues but nothing helps
my Loki configuration
auth_enabled: true
chunk_store_config:
max_look_back_period: 0s
compactor:
compaction_interval: 10m
retention_delete_delay: 2h
retention_delete_worker_count: 150
retention_enabled: true
shared_store: s3
working_directory: /data/loki/boltdb-shipper-compactor
frontend:
compress_responses: true
max_outstanding_per_tenant: 4096
ingester:
chunk_block_size: 262144
chunk_idle_period: 3m
chunk_retain_period: 1m
flush_op_timeout: 10m
lifecycler:
ring:
kvstore:
store: inmemory
replication_factor: 1
max_transfer_retries: 0
wal:
dir: /data/loki/wal
limits_config:
enforce_metric_name: false
max_query_series: 100000
max_query_parallelism: 96
reject_old_samples: true
reject_old_samples_max_age: 168h
retention_period: 24h
querier:
max_concurrent: 20
query_timeout: 5m
engine:
timeout: 5m
query_range:
align_queries_with_step: true
query_scheduler:
max_outstanding_requests_per_tenant: 4096
grpc_client_config:
max_recv_msg_size: 184857600
max_send_msg_size: 184857600
runtime_config:
file: /opt/overrides.yaml
period: 20s
schema_config:
configs:
- from: "2020-10-24"
index:
period: 24h
prefix: index_
object_store: s3
schema: v11
store: boltdb-shipper
server:
http_server_read_timeout: 300s
http_server_write_timeout: 300s
http_listen_port: 3100
grpc_server_max_recv_msg_size: 7894704
grpc_server_max_send_msg_size: 7894704
storage_config:
aws:
access_key_id: yyyyyyy
bucketnames: loki
endpoint: loki-minio.default.svc:9000
insecure: true
s3forcepathstyle: true
secret_access_key: xxxxx
http_config:
response_header_timeout: 60s
boltdb_shipper:
active_index_directory: /data/loki/boltdb-shipper-active
cache_location: /data/loki/boltdb-shipper-cache
cache_ttl: 24h
shared_store: s3
table_manager:
retention_deletes_enabled: false
I think we should discuss this issue
my Loki error logs:
msg=" error notifying frontend about finished query"
Loki does not produce a helpful error log which is a pity.
the problem still persists from the 2.5.0 version.
How large is your data? 2 minutes for an expensive query is not too much. I think setting the right time outs would help.
Hi . thanks for the reply
actually it is not too large. and also i set retention on every 24h. which time out config ? querier.timeout
?
In my opinion, increasing the timeout configuration is not a solution.
Don't get to the core of the problem which is slow performance.
actually it is not too large.
That depends. For the histogram it could be. How are you running Loki? Could create a profile via go tool pprof "http://<Loki Host>/debug/pprof/profile"
actually it is not too large.
That depends. For the histogram it could be. How are you running Loki? Could create a profile via
go tool pprof "http://<Loki Host>/debug/pprof/profile"
Thanks. I'm running multi-tenant Loki with s3 backend storage on Kubernetes cluster deployed via single binary mode with helm chart. with a configuration that I have sent. my clients using Loki as short-term log storage.
The latency on the "/schedulerpb.SchedulerForQuerier/QuerierLoop"
path sometimes occurred when I used a stream selector in a time frame of more than 24h for logs that entered Loki with a rate of 1 log per second for each tenant.
my question is what exactly does this path "/schedulerpb.SchedulerForQuerier/QuerierLoop"
do? and what can I do about it to increase performance and decrease this latency?
Hm, this is interesting. In a microservice setup you'd have a query frontend, a scheduler and querier. When a query is received by the query frontend it sends it on to the scheduler which queues it until it is dequeued by the querier. Once the querier process the query request it answers to the query frontend which in return fulfills the query request.
The QuerierLoop
is the connection between the scheduler and querier. I'm surprised it shows up in you single binary mode at all. Also, the latency should not impact query performance much as these are just the query requests send.
I'm not too familiar with the Helm charts. @trevorwhitney would you know why the scheduler is started in single binary mode?
So if I understand correctly when we deploy Loki as a single binary mode, we don't have a frontend at all, and the scheduler should be disabled by default. @trevorwhitney would you please give us an update about this topic?
and the scheduler should be disabled by default.
Yes
@jeschkies and @LinTechSo sorry for the long delay on this one. I don't know off hand why we put a scheduler in the single binary. Seems reasonable to just go Frontend -> Querier in single binary mode.
hijacking this one, as we have the exact same symptoms using the Scalable monolithic deployment mode
.
K8s v1.26.3 Loki 2.9.1
The LokiRequestLatency
alert deployed by the helm chart fires constantly for the "/schedulerpb.SchedulerForQuerier/QuerierLoop" route if someone does continuous requests / queries. (such as our own tooling or dashboards in Grafana)
The latency on our side is also permanently 100s, which seems unreasonable as the queries are answered in less than 1 second.
The alert seems for this route seems pretty useless on our side as we don't know how to handle it and there is no visible impact here.
I noticed the LokiRequestLatency
rule is defined differently here: https://github.com/grafana/loki/blob/v2.9.3/production/loki-mixin/alerts.libsonnet#L42
alert: LokiRequestLatency
annotations:
message: |
{{ $labels.job }} {{ $labels.route }} is experiencing {{ printf "%.2f" $value }}s 99th percentile latency.
expr: |
cluster_namespace_job_route:loki_request_duration_seconds:99quantile{route!~"(?i).*tail.*|/schedulerpb.SchedulerForQuerier/QuerierLoop"} > 1
for: 15m
labels:
severity: critical
Notice how it excludes the /schedulerpb.SchedulerForQuerier/QuerierLoop
route. Seems like the Helm chart needs to be updated.
I can see how someone who isn't concerned about Helm could miss this when doing a change in the .libsonnet file. 😅
There's in fact two changes between these files. The other change is using cluster_namespace_job_route
instead of namespace_job_route
in the expr
. Not sure if that's intentionally different, otherwise I'd have already opened a MR to fix this.
(And the Helm chart has an additional LokiCanaryLatency
alert that isn't represented in the .libsonnet but I don't think that's relevant.)