loki icon indicating copy to clipboard operation
loki copied to clipboard

Performance issues after upgrading from 2.5.0 to 2.6.1

Open RomainBelorgey opened this issue 2 years ago • 26 comments

Describe the bug

Doing an upgrade from 2.5.0 to 2.6.1 make queries slow and even never finish (reaching timeout).

Rollback to 2.5.0 bring back everything and all queries are fast again.

To Reproduce Steps to reproduce the behavior:

  1. Upgrading to 2.6.1
  2. Lot of queries reaching configured timeout (4mn)

Expected behavior No issues with queries

Environment:

  • Infrastructure: Kubernetes
  • Deployment tool: Helm

Screenshots, Promtail config, or terminal output

From grafana dashboard: image histogram_quantile(0.95, sum(rate(loki_logql_querystats_latency_seconds_bucket[5m])) by (le,type)) I tried things around 1:30pm UTC ( mainly restarting ruler and removing some queries from ruler that I first suspected).

No breaking changes on documentation or on the new helm chart> I checked withgit diff loki-distributed-0.48.4 loki-distributed-0.53.2 charts/loki-distributed/*

I can't make the upgrade to the latest version due to this regression. I could not find any breaking change or explanation for this issue.

Could it be due to this change ? https://github.com/grafana/loki/pull/5662 Current configuration for the limit: split_queries_by_interval: 15m

Don't hesitate to ask me for more information (don't really know what you will need).

Thanks

RomainBelorgey avatar Jul 25 '22 15:07 RomainBelorgey

Hey thanks for reporting it, that is very concerning. Could you confirm that this is only happening for LogQL? That would help a ton.

DylanGuedes avatar Jul 25 '22 15:07 DylanGuedes

Hey thanks for reporting it, that is very concerning. Could you confirm that this is only happening for LogQL? That would help a ton.

What do you mean by only LogQL ? Only when making queries ?

Basic requests (displaying log lines) on grafana worked but the log volume part failed = timeout (looks to be done with sum function ?) and some alerts I got to check if logs are present (using absent_over_time with ruler).

Distributor + ingester got no issues (write path), no specific error logs except seeing slow queries.

RomainBelorgey avatar Jul 25 '22 15:07 RomainBelorgey

What do you mean by only LogQL ? Only when making queries ?

Like did you experience a similar behavior by using Grafana to send the queries? Asking because I recently did found that LogQL doesn't handle context cancellation correctly while Grafana does.

DylanGuedes avatar Jul 25 '22 15:07 DylanGuedes

Like did you experience a similar behavior by using Grafana to send the queries? Asking because I recently did found that LogQL doesn't handle context cancellation correctly while Grafana does.

Did the queries only with grafana not directly with logql but would have been the same I think (hope I understand :sweat_smile: ). I got issues only after doing the upgrade, got no issues before (and also no issues currently when back to 2.5.0).

RomainBelorgey avatar Jul 25 '22 15:07 RomainBelorgey

sorry I messed up; I misunderstood LogQL as LogCLI :facepalm:

DylanGuedes avatar Jul 25 '22 16:07 DylanGuedes

do you mind doing a diff between: your /config result returned by a 2.5 pod vs a 2.6 pod?

DylanGuedes avatar Jul 25 '22 17:07 DylanGuedes

Did the diff between stage (in 2.6.1) and prod (in 2.5.0). So some config are different but mainly limits or timeouts

diff 2.6.1 2.5.0:

38a39
>       container_name: cortex
41,45d41
<       container_name: loki
<       endpoint_suffix: ""
<       use_managed_identity: false
<       user_assigned_id: ""
<       chunk_delimiter: '-'
53,57c49
<     bos:
<       bucket_name: ""
<       endpoint: bj.bcebos.com
<       access_key_id: ""
<       secret_access_key: ""
---
>       use_managed_identity: false
97d88
<         cas_retry_delay: 1s
119c110
<     instance_id: stage-loki-distributed-querier-786d566746-c5kh2
---
>     instance_id: prod-loki-distributed-querier-69d774fc64-27k4c
129d119
<   compactor_address: ""
144,148d133
<     cipher_suites: []
<     curve_preferences: []
<     min_version: 0
<     max_version: 0
<     prefer_server_cipher_suites: false
154,158d138
<     cipher_suites: []
<     curve_preferences: []
<     min_version: 0
<     max_version: 0
<     prefer_server_cipher_suites: false
161,162c141,142
<   http_server_read_timeout: 30s
<   http_server_write_timeout: 30s
---
>   http_server_read_timeout: 5m0s
>   http_server_write_timeout: 5m0s
164,166c144,146
<   grpc_server_max_recv_msg_size: 10000000
<   grpc_server_max_send_msg_size: 10000000
<   grpc_server_max_concurrent_streams: 100
---
>   grpc_server_max_recv_msg_size: 104857600
>   grpc_server_max_send_msg_size: 104857600
>   grpc_server_max_concurrent_streams: 2000
179d158
<   log_request_at_info_level_enabled: false
193d171
<         cas_retry_delay: 1s
213c191
<     instance_id: stage-loki-distributed-querier-786d566746-c5kh2
---
>     instance_id: prod-loki-distributed-querier-69d774fc64-27k4c
220c198
<   query_timeout: 1m0s
---
>   query_timeout: 4m0s
224c202
<     timeout: 5m0s
---
>     timeout: 4m0s
226c204
<   max_concurrent: 10
---
>   max_concurrent: 6
266d243
<           cas_retry_delay: 1s
290d266
<     heartbeat_timeout: 1m0s
304c280
<     id: stage-loki-distributed-querier-786d566746-c5kh2
---
>     id: prod-loki-distributed-querier-69d774fc64-27k4c
327a304
>   engine: chunks
355c332
<     bucketnames: <stage>
---
>     bucketnames: <prod>
377a355
>     container_name: cortex
380,384d357
<     container_name: loki
<     endpoint_suffix: ""
<     use_managed_identity: false
<     user_assigned_id: ""
<     chunk_delimiter: '-'
392,396c365
<   bos:
<     bucket_name: ""
<     endpoint: bj.bcebos.com
<     access_key_id: ""
<     secret_access_key: ""
---
>     use_managed_identity: false
476,480d444
<   grpc_store: {}
<   hedging:
<     at: 0s
<     up_to: 2
<     max_per_second: 5
493c457
<       host: stage-loki-distributed-memcached-index-queries
---
>       host: prod-loki-distributed-memcached-index-queries
527a492,496
>   grpc_store: {}
>   hedging:
>     at: 0s
>     up_to: 2
>     max_per_second: 5
555,558d523
<       log_gateway_requests: false
<     ingestername: stage-loki-distributed-querier-786d566746-c5kh2
<     mode: RO
<     ingesterdbretainperiod: 0s
560,632d524
<   tsdb_shipper:
<     active_index_directory: ""
<     shared_store: ""
<     shared_store_key_prefix: index/
<     cache_location: ""
<     cache_ttl: 24h0m0s
<     resync_interval: 5m0s
<     query_ready_num_days: 0
<     index_gateway_client:
<       grpc_client_config:
<         max_recv_msg_size: 104857600
<         max_send_msg_size: 104857600
<         grpc_compression: ""
<         rate_limit: 0
<         rate_limit_burst: 0
<         backoff_on_ratelimits: false
<         backoff_config:
<           min_period: 100ms
<           max_period: 10s
<           max_retries: 10
<         tls_enabled: false
<         tls_cert_path: ""
<         tls_key_path: ""
<         tls_ca_path: ""
<         tls_server_name: ""
<         tls_insecure_skip_verify: false
<       log_gateway_requests: false
<     ingestername: stage-loki-distributed-querier-786d566746-c5kh2
<     mode: RO
<     ingesterdbretainperiod: 0s
< index_gateway:
<   mode: simple
<   ring:
<     kvstore:
<       store: memberlist
<       prefix: collectors/
<       consul:
<         host: localhost:8500
<         acl_token: ""
<         http_client_timeout: 20s
<         consistent_reads: false
<         watch_rate_limit: 1
<         watch_burst_size: 1
<         cas_retry_delay: 1s
<       etcd:
<         endpoints: []
<         dial_timeout: 10s
<         max_retries: 10
<         tls_enabled: false
<         tls_cert_path: ""
<         tls_key_path: ""
<         tls_ca_path: ""
<         tls_server_name: ""
<         tls_insecure_skip_verify: false
<         username: ""
<         password: ""
<       multi:
<         primary: ""
<         secondary: ""
<         mirror_enabled: false
<         mirror_timeout: 2s
<     heartbeat_period: 15s
<     heartbeat_timeout: 1m0s
<     tokens_file_path: ""
<     zone_awareness_enabled: false
<     instance_id: stage-loki-distributed-querier-786d566746-c5kh2
<     instance_interface_names:
<     - eth0
<     - lo
<     instance_port: 0
<     instance_addr: ""
<     instance_availability_zone: ""
<     replication_factor: 3
645c537
<       host: stage-loki-distributed-memcached-chunks
---
>       host: prod-loki-distributed-memcached-chunks
741,742c633,634
<   ingestion_rate_mb: 25
<   ingestion_burst_size_mb: 50
---
>   ingestion_rate_mb: 50
>   ingestion_burst_size_mb: 100
746c638
<   reject_old_samples: true
---
>   reject_old_samples: false
752d643
<   increment_duplicate_timestamp: false
756,757c647,648
<   per_stream_rate_limit: 3145728
<   per_stream_rate_limit_burst: 15728640
---
>   per_stream_rate_limit: 52428800
>   per_stream_rate_limit_burst: 104857600
759c650
<   max_query_series: 500
---
>   max_query_series: 500000
763c654
<   cardinality_limit: 100000
---
>   cardinality_limit: 1000000
787d677
<   allow_deletes: false
880c770
<   frontend_address: stage-loki-distributed-query-frontend:9095
---
>   frontend_address: prod-loki-distributed-query-frontend:9095
883c773
<   parallelism: 10
---
>   parallelism: 6
936,942c826
<   tail_proxy_url: http://stage-loki-distributed-querier:3100
<   tail_tls_config:
<     tls_cert_path: ""
<     tls_key_path: ""
<     tls_ca_path: ""
<     tls_server_name: ""
<     tls_insecure_skip_verify: false
---
>   tail_proxy_url: http://prod-loki-distributed-querier:3100
975a860
>       container_name: cortex
978,982d862
<       container_name: loki
<       endpoint_suffix: ""
<       use_managed_identity: false
<       user_assigned_id: ""
<       chunk_delimiter: '-'
989a870
>       use_managed_identity: false
1020,1024d900
<     bos:
<       bucket_name: ""
<       endpoint: bj.bcebos.com
<       access_key_id: ""
<       secret_access_key: ""
1061d936
<     type: Bearer
1079d953
<         cas_retry_delay: 1s
1099c973
<     instance_id: stage-loki-distributed-querier-786d566746-c5kh2
---
>     instance_id: prod-loki-distributed-querier-69d774fc64-27k4c
1125c999
<       name: cortex-stage
---
>       name: cortex-prod
1134a1009
>         retry_on_http_429: true
1156c1031
<         host: stage-loki-distributed-memcached-frontend
---
>         host: prod-loki-distributed-memcached-frontend
1210,1211d1084
<   cluster_label: ""
<   cluster_label_verification_disabled: false
1213c1086
<   - stage-loki-distributed-memberlist
---
>   - prod-loki-distributed-memberlist
1217c1090
<   abort_if_cluster_join_fails: false
---
>   abort_if_cluster_join_fails: true
1243d1115
<   deletion_mode: disabled
1257d1128
<         cas_retry_delay: 1s
1279c1150
<     instance_id: stage-loki-distributed-querier-786d566746-c5kh2
---
>     instance_id: prod-loki-distributed-querier-69d774fc64-27k4c
1288d1158
<   querier_forget_delay: 0s
1318d1187
<         cas_retry_delay: 1s
1340c1209
<     instance_id: stage-loki-distributed-querier-786d566746-c5kh2
---
>     instance_id: prod-loki-distributed-querier-69d774fc64-27k4c

Can provide the full files but not sure that it can be uploaded on github :thinking:

RomainBelorgey avatar Jul 26 '22 07:07 RomainBelorgey

Is there anything else I could do, share to help ?

RomainBelorgey avatar Aug 01 '22 12:08 RomainBelorgey

I just ran into the same issue, having upgraded to 2.6.1 from 2.4.2 using loki-simple-scalable Helm chart (0.2.0 -> 1.8.6).

The issue affects querying only (i.e. ingestion seems to work fine) and manifests both in Grafana as well as in LogCLI when querying for logs (/api/v1/query_range) as well as for labels (/api/v1/labels).

Labels queries (logcli labels) get stuck and time out about 20% of the time. Raising the timeout (in all relevant places) doesn't help.

In case it might help, here's a trace of a labels query timing out:

image

And for comparison a trace of a labels query that succeeded:

image

For now I reverted back to 2.4.2.

panos-- avatar Aug 11 '22 14:08 panos--

We're seeing something similar here too after upgrading today, query range requests have shot up in duration seconds.

image

Alan01252 avatar Aug 17 '22 14:08 Alan01252

Hi, any updates ? same issue here . Grafana sometime gives me error about query timeouts and log volume errors

Failed to load log volume for this query
context deadline exceeded

using distributed Loki helm chart version 2.6.1

LinTechSo avatar Sep 04 '22 19:09 LinTechSo

Is this being check by the maintainers ? Is there a way to put labels on the issue to make it more visible maybe ?

Is there any other information needed ?

RomainBelorgey avatar Sep 15 '22 07:09 RomainBelorgey

I caught the same problem too. I found some queries becomes very slow after upgrading Querier to a nightly version which contains 5662. But it has no effect on Frontend.

So at last I just upgraded Frontend contains 5662 to improve Instant query performance and keep Querier to 2.5.x.

honganan avatar Sep 15 '22 08:09 honganan

Sorry, we overlooked this because the visibility of issues on Github isn't super good (and the number of open issues is quite high right now). I'll try to check this week.

DylanGuedes avatar Sep 15 '22 11:09 DylanGuedes

Ok I did the following: I compared what query_range latency for our clusters was looking like for our clusters for ~1day ago, ~10days ago, ~30days ago, ~30days ago and ~90 days ago and based on it, we didn't suffer performance degradations on those. Also, there are other community users running Loki on the latest versions etc and seems like they are not suffering this thing too. So looks like the 2.6 update has a problem but we didn't experience it because some configuration is protecting us/hidding the issue

@panos-- are you using SSD mode?

@Alan01252 did you upgrade from 2.5 to 2.6 or from 2.4? I'm trying to isolate the problem was introduced between 2.5 and 2.6

@honganan could you try changing only your querier to use our latest main? I want to check the bug wasn't fixed after 2.6 was released.

DylanGuedes avatar Sep 15 '22 18:09 DylanGuedes

2.5 to 2.6 .. Running 2.5 now without issues, we were using the Loki scalable helm chart with mostly the default settings.

Alan01252 avatar Sep 15 '22 18:09 Alan01252

interesting, so you're using ssd. @honganan are you also using SSD?

DylanGuedes avatar Sep 15 '22 18:09 DylanGuedes

@DylanGuedes We're using the loki-simple-scalable Helm chart. So, SSD mode I guess.

panos-- avatar Sep 16 '22 09:09 panos--

one more thing: are ALL your query_range queries timing out? or just the most expensive ones? if just some special queries are timing out, could you share them? @panos--

trying to isolate things:

  • If all your queries are timing out then it is probably a connection problem with the SSD helm chart
  • If it is just queries that have large time window range, then it is probably a configuration that tunes queries behavior
  • If it is a query similar to the one reported by @RomainBelorgey (histogram_quantile(0.95, sum(rate() then it might be a problem with logql etc

DylanGuedes avatar Sep 16 '22 10:09 DylanGuedes

What I reported is the query to see the time taken from loki to make requests: histogram_quantile(0.95, sum(rate(loki_logql_querystats_latency_seconds_bucket[5m])) by (le,type))

But the issue is visible on all queries not only a specific one and on also instant queries done by ruler.

What do you mean by SSD mode ? Currently using https://github.com/grafana/helm-charts/tree/main/charts/loki-distributed

RomainBelorgey avatar Sep 16 '22 10:09 RomainBelorgey

What do you mean by SSD mode ? Currently using https://github.com/grafana/helm-charts/tree/main/charts/loki-distributed

Loki has three modes of operation: monolithic, microservices and SSD. You can find more details here. SSD is the most recent one and because of this, it is still under improvement.

Regarding the helmchart: We migrated from that repo to this Loki repo, you can find it here. In the readme there's a migration guide. Since this migration is new, it is still under development but we'll be happy if you report any issues regarding it.

DylanGuedes avatar Sep 16 '22 10:09 DylanGuedes

Loki has three modes of operation: monolithic, microservices and SSD. You can find more details here. SSD is the most recent one and because of this, it is still under improvement.

Ah okay, so for me it is microservices (and as using loki-distributed helm chart)

Regarding the helmchart: We migrated from that repo to this Loki repo, you can find it here. In the readme there's a migration guide. Since this migration is new, it is still under development but we'll be happy if you report any issues regarding it.

Thanks for pointing that, will check but it won't help regarding the issue.

RomainBelorgey avatar Sep 16 '22 10:09 RomainBelorgey

Thanks for pointing that, will check but it won't help regarding the issue.

Agreed, but still, do you mind also reporting this issue on that repo? There's a chance of this being related to a connection problem and hence the culprit would be the helmchart itself.

Sorry for the back-and-forth of questions, I reported this to other maintainers but at this point I'm first trying to get as much data/evidence/clarifications to make it easier to replicate things.

DylanGuedes avatar Sep 16 '22 10:09 DylanGuedes

Checking https://github.com/grafana/loki/tree/main/production/helm/loki but can't use that as it will deploy a single loki binary doing everything, not what I want.

Agreed, but still, do you mind also reporting this issue on that repo? There's a chance of this being related to a connection problem and hence the culprit would be the helmchart itself.

It is not the helm chart fault, the only change here from 2.5.x to 2.6.x is the binary + maybe some configuration file changes, nothing on the way the different loki components are deployed and interacts with each-others. I still got my stage env upgraded and having no issues (so using 2.6.x) but with less logs and less requests, so not a connection issue here (more a performance one).

Sorry for the back-and-forth of questions, I reported this to other maintainers but at this point I'm first trying to get as much data/evidence/clarifications to make it easier to replicate things.

No worries, thanks for taking care of it ;) and I hope I am helpful :p

RomainBelorgey avatar Sep 16 '22 11:09 RomainBelorgey

For me it's also not just special queries. And also not consistently. As stated above, easiest to trigger the error was by calling logcli labels. Not all of those queries timed out but just ~20%.

Also, same as @RomainBelorgey, I have my staging environment (with less data) running fine on 2.6. No isssues there.

panos-- avatar Sep 16 '22 12:09 panos--

interesting, so you're using ssd. @honganan are you also using SSD?

No, I am using Cassandra + S3. I am sorry our fork have changed many and is not easy to change to 2.6.x to check in short time

honganan avatar Sep 19 '22 02:09 honganan

Any update ? What should we do to help fix this ? Or is that possible to revert the change made on 2.6.x ?

RomainBelorgey avatar Oct 14 '22 21:10 RomainBelorgey

@RomainBelorgey Could you share config 2.5.0 and 2.6.1 for loki-distributed ? Thank a lot.

patsevanton avatar Nov 06 '22 14:11 patsevanton

@patsevanton Not changes between the 2, except maybe some limits / perf things (2.5.0 = prod with more power; 2.6.1 = stage)

loki_config_files.tar.gz

Thx, hope it helps in a way :shrug:

RomainBelorgey avatar Nov 07 '22 08:11 RomainBelorgey

@RomainBelorgey Do you have compact config ?

patsevanton avatar Nov 09 '22 14:11 patsevanton