loki
loki copied to clipboard
Performance issues after upgrading from 2.5.0 to 2.6.1
Describe the bug
Doing an upgrade from 2.5.0 to 2.6.1 make queries slow and even never finish (reaching timeout).
Rollback to 2.5.0 bring back everything and all queries are fast again.
To Reproduce Steps to reproduce the behavior:
- Upgrading to 2.6.1
- Lot of queries reaching configured timeout (4mn)
Expected behavior No issues with queries
Environment:
- Infrastructure: Kubernetes
- Deployment tool: Helm
Screenshots, Promtail config, or terminal output
From grafana dashboard:
histogram_quantile(0.95, sum(rate(loki_logql_querystats_latency_seconds_bucket[5m])) by (le,type))
I tried things around 1:30pm UTC ( mainly restarting ruler
and removing some queries from ruler
that I first suspected).
No breaking changes on documentation or on the new helm chart>
I checked withgit diff loki-distributed-0.48.4 loki-distributed-0.53.2 charts/loki-distributed/*
I can't make the upgrade to the latest version due to this regression. I could not find any breaking change or explanation for this issue.
Could it be due to this change ? https://github.com/grafana/loki/pull/5662
Current configuration for the limit: split_queries_by_interval: 15m
Don't hesitate to ask me for more information (don't really know what you will need).
Thanks
Hey thanks for reporting it, that is very concerning. Could you confirm that this is only happening for LogQL? That would help a ton.
Hey thanks for reporting it, that is very concerning. Could you confirm that this is only happening for LogQL? That would help a ton.
What do you mean by only LogQL
? Only when making queries ?
Basic requests (displaying log lines) on grafana worked but the log volume
part failed = timeout (looks to be done with sum
function ?) and some alerts I got to check if logs are present (using absent_over_time
with ruler
).
Distributor + ingester got no issues (write path), no specific error logs except seeing slow queries.
What do you mean by only
LogQL
? Only when making queries ?
Like did you experience a similar behavior by using Grafana to send the queries? Asking because I recently did found that LogQL doesn't handle context cancellation correctly while Grafana does.
Like did you experience a similar behavior by using Grafana to send the queries? Asking because I recently did found that LogQL doesn't handle context cancellation correctly while Grafana does.
Did the queries only with grafana not directly with logql but would have been the same I think (hope I understand :sweat_smile: ). I got issues only after doing the upgrade, got no issues before (and also no issues currently when back to 2.5.0).
sorry I messed up; I misunderstood LogQL as LogCLI :facepalm:
do you mind doing a diff between:
your /config
result returned by a 2.5 pod vs a 2.6 pod?
Did the diff between stage (in 2.6.1) and prod (in 2.5.0). So some config are different but mainly limits or timeouts
diff 2.6.1 2.5.0
:
38a39
> container_name: cortex
41,45d41
< container_name: loki
< endpoint_suffix: ""
< use_managed_identity: false
< user_assigned_id: ""
< chunk_delimiter: '-'
53,57c49
< bos:
< bucket_name: ""
< endpoint: bj.bcebos.com
< access_key_id: ""
< secret_access_key: ""
---
> use_managed_identity: false
97d88
< cas_retry_delay: 1s
119c110
< instance_id: stage-loki-distributed-querier-786d566746-c5kh2
---
> instance_id: prod-loki-distributed-querier-69d774fc64-27k4c
129d119
< compactor_address: ""
144,148d133
< cipher_suites: []
< curve_preferences: []
< min_version: 0
< max_version: 0
< prefer_server_cipher_suites: false
154,158d138
< cipher_suites: []
< curve_preferences: []
< min_version: 0
< max_version: 0
< prefer_server_cipher_suites: false
161,162c141,142
< http_server_read_timeout: 30s
< http_server_write_timeout: 30s
---
> http_server_read_timeout: 5m0s
> http_server_write_timeout: 5m0s
164,166c144,146
< grpc_server_max_recv_msg_size: 10000000
< grpc_server_max_send_msg_size: 10000000
< grpc_server_max_concurrent_streams: 100
---
> grpc_server_max_recv_msg_size: 104857600
> grpc_server_max_send_msg_size: 104857600
> grpc_server_max_concurrent_streams: 2000
179d158
< log_request_at_info_level_enabled: false
193d171
< cas_retry_delay: 1s
213c191
< instance_id: stage-loki-distributed-querier-786d566746-c5kh2
---
> instance_id: prod-loki-distributed-querier-69d774fc64-27k4c
220c198
< query_timeout: 1m0s
---
> query_timeout: 4m0s
224c202
< timeout: 5m0s
---
> timeout: 4m0s
226c204
< max_concurrent: 10
---
> max_concurrent: 6
266d243
< cas_retry_delay: 1s
290d266
< heartbeat_timeout: 1m0s
304c280
< id: stage-loki-distributed-querier-786d566746-c5kh2
---
> id: prod-loki-distributed-querier-69d774fc64-27k4c
327a304
> engine: chunks
355c332
< bucketnames: <stage>
---
> bucketnames: <prod>
377a355
> container_name: cortex
380,384d357
< container_name: loki
< endpoint_suffix: ""
< use_managed_identity: false
< user_assigned_id: ""
< chunk_delimiter: '-'
392,396c365
< bos:
< bucket_name: ""
< endpoint: bj.bcebos.com
< access_key_id: ""
< secret_access_key: ""
---
> use_managed_identity: false
476,480d444
< grpc_store: {}
< hedging:
< at: 0s
< up_to: 2
< max_per_second: 5
493c457
< host: stage-loki-distributed-memcached-index-queries
---
> host: prod-loki-distributed-memcached-index-queries
527a492,496
> grpc_store: {}
> hedging:
> at: 0s
> up_to: 2
> max_per_second: 5
555,558d523
< log_gateway_requests: false
< ingestername: stage-loki-distributed-querier-786d566746-c5kh2
< mode: RO
< ingesterdbretainperiod: 0s
560,632d524
< tsdb_shipper:
< active_index_directory: ""
< shared_store: ""
< shared_store_key_prefix: index/
< cache_location: ""
< cache_ttl: 24h0m0s
< resync_interval: 5m0s
< query_ready_num_days: 0
< index_gateway_client:
< grpc_client_config:
< max_recv_msg_size: 104857600
< max_send_msg_size: 104857600
< grpc_compression: ""
< rate_limit: 0
< rate_limit_burst: 0
< backoff_on_ratelimits: false
< backoff_config:
< min_period: 100ms
< max_period: 10s
< max_retries: 10
< tls_enabled: false
< tls_cert_path: ""
< tls_key_path: ""
< tls_ca_path: ""
< tls_server_name: ""
< tls_insecure_skip_verify: false
< log_gateway_requests: false
< ingestername: stage-loki-distributed-querier-786d566746-c5kh2
< mode: RO
< ingesterdbretainperiod: 0s
< index_gateway:
< mode: simple
< ring:
< kvstore:
< store: memberlist
< prefix: collectors/
< consul:
< host: localhost:8500
< acl_token: ""
< http_client_timeout: 20s
< consistent_reads: false
< watch_rate_limit: 1
< watch_burst_size: 1
< cas_retry_delay: 1s
< etcd:
< endpoints: []
< dial_timeout: 10s
< max_retries: 10
< tls_enabled: false
< tls_cert_path: ""
< tls_key_path: ""
< tls_ca_path: ""
< tls_server_name: ""
< tls_insecure_skip_verify: false
< username: ""
< password: ""
< multi:
< primary: ""
< secondary: ""
< mirror_enabled: false
< mirror_timeout: 2s
< heartbeat_period: 15s
< heartbeat_timeout: 1m0s
< tokens_file_path: ""
< zone_awareness_enabled: false
< instance_id: stage-loki-distributed-querier-786d566746-c5kh2
< instance_interface_names:
< - eth0
< - lo
< instance_port: 0
< instance_addr: ""
< instance_availability_zone: ""
< replication_factor: 3
645c537
< host: stage-loki-distributed-memcached-chunks
---
> host: prod-loki-distributed-memcached-chunks
741,742c633,634
< ingestion_rate_mb: 25
< ingestion_burst_size_mb: 50
---
> ingestion_rate_mb: 50
> ingestion_burst_size_mb: 100
746c638
< reject_old_samples: true
---
> reject_old_samples: false
752d643
< increment_duplicate_timestamp: false
756,757c647,648
< per_stream_rate_limit: 3145728
< per_stream_rate_limit_burst: 15728640
---
> per_stream_rate_limit: 52428800
> per_stream_rate_limit_burst: 104857600
759c650
< max_query_series: 500
---
> max_query_series: 500000
763c654
< cardinality_limit: 100000
---
> cardinality_limit: 1000000
787d677
< allow_deletes: false
880c770
< frontend_address: stage-loki-distributed-query-frontend:9095
---
> frontend_address: prod-loki-distributed-query-frontend:9095
883c773
< parallelism: 10
---
> parallelism: 6
936,942c826
< tail_proxy_url: http://stage-loki-distributed-querier:3100
< tail_tls_config:
< tls_cert_path: ""
< tls_key_path: ""
< tls_ca_path: ""
< tls_server_name: ""
< tls_insecure_skip_verify: false
---
> tail_proxy_url: http://prod-loki-distributed-querier:3100
975a860
> container_name: cortex
978,982d862
< container_name: loki
< endpoint_suffix: ""
< use_managed_identity: false
< user_assigned_id: ""
< chunk_delimiter: '-'
989a870
> use_managed_identity: false
1020,1024d900
< bos:
< bucket_name: ""
< endpoint: bj.bcebos.com
< access_key_id: ""
< secret_access_key: ""
1061d936
< type: Bearer
1079d953
< cas_retry_delay: 1s
1099c973
< instance_id: stage-loki-distributed-querier-786d566746-c5kh2
---
> instance_id: prod-loki-distributed-querier-69d774fc64-27k4c
1125c999
< name: cortex-stage
---
> name: cortex-prod
1134a1009
> retry_on_http_429: true
1156c1031
< host: stage-loki-distributed-memcached-frontend
---
> host: prod-loki-distributed-memcached-frontend
1210,1211d1084
< cluster_label: ""
< cluster_label_verification_disabled: false
1213c1086
< - stage-loki-distributed-memberlist
---
> - prod-loki-distributed-memberlist
1217c1090
< abort_if_cluster_join_fails: false
---
> abort_if_cluster_join_fails: true
1243d1115
< deletion_mode: disabled
1257d1128
< cas_retry_delay: 1s
1279c1150
< instance_id: stage-loki-distributed-querier-786d566746-c5kh2
---
> instance_id: prod-loki-distributed-querier-69d774fc64-27k4c
1288d1158
< querier_forget_delay: 0s
1318d1187
< cas_retry_delay: 1s
1340c1209
< instance_id: stage-loki-distributed-querier-786d566746-c5kh2
---
> instance_id: prod-loki-distributed-querier-69d774fc64-27k4c
Can provide the full files but not sure that it can be uploaded on github :thinking:
Is there anything else I could do, share to help ?
I just ran into the same issue, having upgraded to 2.6.1 from 2.4.2 using loki-simple-scalable Helm chart (0.2.0 -> 1.8.6).
The issue affects querying only (i.e. ingestion seems to work fine) and manifests both in Grafana as well as in LogCLI when querying for logs (/api/v1/query_range
) as well as for labels (/api/v1/labels
).
Labels queries (logcli labels
) get stuck and time out about 20% of the time. Raising the timeout (in all relevant places) doesn't help.
In case it might help, here's a trace of a labels query timing out:
And for comparison a trace of a labels query that succeeded:
For now I reverted back to 2.4.2.
We're seeing something similar here too after upgrading today, query range requests have shot up in duration seconds.
Hi, any updates ? same issue here . Grafana sometime gives me error about query timeouts and log volume errors
Failed to load log volume for this query
context deadline exceeded
using distributed Loki helm chart version 2.6.1
Is this being check by the maintainers ? Is there a way to put labels on the issue to make it more visible maybe ?
Is there any other information needed ?
I caught the same problem too. I found some queries becomes very slow after upgrading Querier to a nightly version which contains 5662. But it has no effect on Frontend.
So at last I just upgraded Frontend contains 5662 to improve Instant query performance and keep Querier to 2.5.x.
Sorry, we overlooked this because the visibility of issues on Github isn't super good (and the number of open issues is quite high right now). I'll try to check this week.
Ok I did the following: I compared what query_range latency for our clusters was looking like for our clusters for ~1day ago, ~10days ago, ~30days ago, ~30days ago and ~90 days ago and based on it, we didn't suffer performance degradations on those. Also, there are other community users running Loki on the latest versions etc and seems like they are not suffering this thing too. So looks like the 2.6 update has a problem but we didn't experience it because some configuration is protecting us/hidding the issue
@panos-- are you using SSD mode?
@Alan01252 did you upgrade from 2.5 to 2.6 or from 2.4? I'm trying to isolate the problem was introduced between 2.5 and 2.6
@honganan could you try changing only your querier to use our latest main? I want to check the bug wasn't fixed after 2.6 was released.
2.5 to 2.6 .. Running 2.5 now without issues, we were using the Loki scalable helm chart with mostly the default settings.
interesting, so you're using ssd. @honganan are you also using SSD?
@DylanGuedes We're using the loki-simple-scalable Helm chart. So, SSD mode I guess.
one more thing: are ALL your query_range queries timing out? or just the most expensive ones? if just some special queries are timing out, could you share them? @panos--
trying to isolate things:
- If all your queries are timing out then it is probably a connection problem with the SSD helm chart
- If it is just queries that have large time window range, then it is probably a configuration that tunes queries behavior
- If it is a query similar to the one reported by @RomainBelorgey (
histogram_quantile(0.95, sum(rate(
) then it might be a problem with logql etc
What I reported is the query to see the time taken from loki
to make requests: histogram_quantile(0.95, sum(rate(loki_logql_querystats_latency_seconds_bucket[5m])) by (le,type))
But the issue is visible on all queries not only a specific one and on also instant queries done by ruler
.
What do you mean by SSD mode ? Currently using https://github.com/grafana/helm-charts/tree/main/charts/loki-distributed
What do you mean by SSD mode ? Currently using https://github.com/grafana/helm-charts/tree/main/charts/loki-distributed
Loki has three modes of operation: monolithic, microservices and SSD. You can find more details here. SSD is the most recent one and because of this, it is still under improvement.
Regarding the helmchart: We migrated from that repo to this Loki repo, you can find it here. In the readme there's a migration guide. Since this migration is new, it is still under development but we'll be happy if you report any issues regarding it.
Loki has three modes of operation: monolithic, microservices and SSD. You can find more details here. SSD is the most recent one and because of this, it is still under improvement.
Ah okay, so for me it is microservices (and as using loki-distributed
helm chart)
Regarding the helmchart: We migrated from that repo to this Loki repo, you can find it here. In the readme there's a migration guide. Since this migration is new, it is still under development but we'll be happy if you report any issues regarding it.
Thanks for pointing that, will check but it won't help regarding the issue.
Thanks for pointing that, will check but it won't help regarding the issue.
Agreed, but still, do you mind also reporting this issue on that repo? There's a chance of this being related to a connection problem and hence the culprit would be the helmchart itself.
Sorry for the back-and-forth of questions, I reported this to other maintainers but at this point I'm first trying to get as much data/evidence/clarifications to make it easier to replicate things.
Checking https://github.com/grafana/loki/tree/main/production/helm/loki but can't use that as it will deploy a single loki binary doing everything, not what I want.
Agreed, but still, do you mind also reporting this issue on that repo? There's a chance of this being related to a connection problem and hence the culprit would be the helmchart itself.
It is not the helm chart fault, the only change here from 2.5.x to 2.6.x is the binary + maybe some configuration file changes, nothing on the way the different loki
components are deployed and interacts with each-others.
I still got my stage env upgraded and having no issues (so using 2.6.x) but with less logs and less requests, so not a connection issue here (more a performance one).
Sorry for the back-and-forth of questions, I reported this to other maintainers but at this point I'm first trying to get as much data/evidence/clarifications to make it easier to replicate things.
No worries, thanks for taking care of it ;) and I hope I am helpful :p
For me it's also not just special queries. And also not consistently. As stated above, easiest to trigger the error was by calling logcli labels
. Not all of those queries timed out but just ~20%.
Also, same as @RomainBelorgey, I have my staging environment (with less data) running fine on 2.6. No isssues there.
interesting, so you're using ssd. @honganan are you also using SSD?
No, I am using Cassandra + S3. I am sorry our fork have changed many and is not easy to change to 2.6.x to check in short time
Any update ? What should we do to help fix this ? Or is that possible to revert the change made on 2.6.x ?
@RomainBelorgey Could you share config 2.5.0 and 2.6.1 for loki-distributed ? Thank a lot.
@patsevanton Not changes between the 2, except maybe some limits / perf things (2.5.0 = prod with more power; 2.6.1 = stage)
Thx, hope it helps in a way :shrug:
@RomainBelorgey Do you have compact config ?