loki icon indicating copy to clipboard operation
loki copied to clipboard

Grafana dashboard shows "too many outstanding requests" after upgrade to v2.4.2

Open itsnotv opened this issue 3 years ago • 49 comments

Describe the bug

After upgrading to v2.4.2 from v2.4.1, none of the panels using loki show any data. I have a dashboard with 4 panels that load data from loki. I am able to see data ingested correctly with grafana explore datasource query.

Environment

Using loki with docker-compose and shipping docker logs with loki driver.

loki.yml

auth_enabled: false

server:
  http_listen_port: 3100

common:
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    instance_addr: 127.0.0.1
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

table_manager:
  retention_deletes_enabled: true
  retention_period: 168h

error on the grafana panel

status:429
statusText:""
data:Object
message:"too many outstanding requests"
error:""
response:"too many outstanding requests"
config:Object
headers:Object
url:"api/datasources/proxy/9/loki/api/v1/query_range?direction=BACKWARD&limit=240&query=sum(count_over_time(%7Bcontainer_name%3D%22nginx%22%2Csource%3D%22stdout%22%7D%5B2m%5D))&start=1641993484871000000&end=1642015084871000000&step=120"
retry:0
hideFromInspector:false
message:"too many outstanding requests"

itsnotv avatar Jan 12 '22 19:01 itsnotv

Hi, I resolved the pb for my part by increasing two default value with:

querier: max_concurrent: 2048 query_scheduler: max_outstanding_requests_per_tenant: 2048

It's not perfect, but this error, helped me to understand better and better the architecture. Next step, used query_frontend (that is not mandatory but active if we add something in config) to do the job (Queueing) and of coure decrease as possible these values for my home docker service.

zaibakker avatar Jan 16 '22 22:01 zaibakker

Hi, I come back after tried many, many settings.

I solved my pb with:

  • activate the query_frontend
  • Reduce the splitting of queries with split_queries_by_interval: 24h
  • max_outstanding_per_tenant: 1024

My dashboard is complete now in 5s :) Without the splitting parameter, i had always 429 error for 1 or 3 graph and a rendered in 3min

It works for me because i had a lot of small request. Too much for my docker, loki process. Reduce them was the solution. Increase worker, frontend, parallelism or timeout was a bad idea.

zaibakker avatar Jan 18 '22 19:01 zaibakker

see https://github.com/grafana/loki/pull/5204

dfoxg avatar Jan 22 '22 22:01 dfoxg

For completeness, here's the needed config

query_range:
  split_queries_by_interval: 24h

frontend:
  max_outstanding_per_tenant: 1024

yakob-aleksandrovich avatar Jan 25 '22 11:01 yakob-aleksandrovich

For completeness, here's the needed config

query_range:
  split_queries_by_interval: 24h

frontend:
  max_outstanding_per_tenant: 1024

This helped partially, I still see the error every now and then.

itsnotv avatar Jan 29 '22 19:01 itsnotv

You can raise max_outstanding_per_tenant even higher. I've set mine to 4096 now. But I'm afraid you can never avoid 'too many requests' completely. As far as I understand (still learning...), the more data you try to load, the more often you will hit this limit. In my case, 'loading more data' is caused because in Grafana I want to view the whole 721 hours (30 days), or because I've crammed too much queries into one graph.

I'm still working on finding the right trade-off between memory-usage and speed. Below, you'll see my current partial configuration, relevant to this specific issue.

server:
  http_listen_port: 3100
  grpc_listen_port: 9096
  # Read timeout for HTTP server
  http_server_read_timeout: 3m
  # Write timeout for HTTP server
  http_server_write_timeout: 3m

query_range:
  split_queries_by_interval: 0
  parallelise_shardable_queries: false

querier:
  max_concurrent: 2048

frontend:
  max_outstanding_per_tenant: 4096
  compress_responses: true

yakob-aleksandrovich avatar Jan 31 '22 08:01 yakob-aleksandrovich

query_range: split_queries_by_interval: 0

This part seems to help.

I never ran into this issue with 2.4.1. Something changed in 2.4.2, I hope they restore the default values to what it was before.

itsnotv avatar Jan 31 '22 14:01 itsnotv

For completeness, here's the needed config

query_range:
  split_queries_by_interval: 24h

frontend:
  max_outstanding_per_tenant: 1024

This worked for my setup, thanks!

dotdc avatar Feb 21 '22 09:02 dotdc

I can also confirm that on v2.4.2 you will face this issue if you keep new default value.

Switching value back to old default from version v2.4.1 solve my problem.

query_range:
split_queries_by_interval: 0

whoamiUNIX avatar Mar 16 '22 18:03 whoamiUNIX

Bump, this is a serious issue. Please fix Loki team.

benisai avatar Apr 11 '22 04:04 benisai

I'm not able to solve my problem using none of the above values/options on version 2.4.2. We rolled back our loki to version 2.4.1 and this solved our issue. Let's wait for the Loki team fix.

DanielVenturini avatar Apr 11 '22 19:04 DanielVenturini

2.5.0 also have this problem

step-baby avatar Apr 13 '22 05:04 step-baby

queue.go#L105-L107

	select {
	case queue <- req:
		q.queueLength.WithLabelValues(userID).Inc()
		q.cond.Broadcast()
		// Call this function while holding a lock. This guarantees that no querier can fetch the request before function returns.
		if successFn != nil {
			successFn()
		}
		return nil
	//default:
	//	q.discardedRequests.WithLabelValues(userID).Inc()
	//	return ErrTooManyRequests
	}

After removing this part of the code, the problem was alleviated

123shang60 avatar Apr 13 '22 12:04 123shang60

We got the same error with v2.5.0. None of the above options solved the issue so we rolled back to v2.4.1.

LibiKorol avatar May 03 '22 06:05 LibiKorol

Is there an ETA for a fix?

LibiKorol avatar May 24 '22 09:05 LibiKorol

i can confirm this issue exists after an upgrade to the newest version, i can't even roll back to 2.4.1, i may note that 2.4.1 uses v1beta tags and will not be available on gcp very soon

Alcatros avatar Jun 10 '22 05:06 Alcatros

We also had a lot of "403 too many outstanding requests" on loki 2.5.0 and 2.4.2. Moved back to loki 2.4.1 and problem is gone.

wuestkamp avatar Jun 15 '22 08:06 wuestkamp

@wuestkamp the issue is really that the 2.4.1 has security issues and is deprecated soon by the new k8 cluster versions

Alcatros avatar Jun 15 '22 14:06 Alcatros

So why is grafana labs not fixing this issue? I don't understand. Why is it so hard?

benisai avatar Jun 15 '22 15:06 benisai

@benisai i wish i knew ensure you are only using this in an isolated network the CVE's could lead to break ins and Grafana is a data pod with potentially lots of customer logs etc... don't endanger your company by running old versions

Alcatros avatar Jun 15 '22 15:06 Alcatros

@benisai i wish i knew ensure you are only using this in an isolated network the CVE's could lead to break ins and Grafana is a data pod with potentially lots of customer logs etc... don't endanger your company by running old versions.

Homelab only. But still the issue persist without a fix. Or is there a fix?

benisai avatar Jun 15 '22 16:06 benisai

I'm too lazy to set up a configuration file, so I just downgraded to 2.4.1 (homelab). I wish there was a way to configure Loki with environment variables. Configuration files are a pain.

clouedoc avatar Jun 24 '22 13:06 clouedoc

Hi, I resolved the pb for my part by increasing two default value with:

querier: max_concurrent: 2048 query_scheduler: max_outstanding_requests_per_tenant: 2048

It's not perfect, but this error, helped me to understand better and better the architecture. Next step, used query_frontend (that is not mandatory but active if we add something in config) to do the job (Queueing) and of coure decrease as possible these values for my home docker service.

That's works form me with ansible

  • name: Create loki service tags: grafana docker_container: name: loki restart_policy: always image: "grafana/loki:2.5.0" log_driver: syslog log_options: tag: lokilog networks: - name: "loki" command: "-config.file=/etc/loki/local-config.yaml -querier.max-outstanding-requests-per-tenant=2048 -querier.max-concurrent=2048"

onovaes avatar Jun 25 '22 14:06 onovaes

Hi, any updates ? Thanks for the info. But the problem still persist on Loki version 2.5.0

LinTechSo avatar Jun 26 '22 10:06 LinTechSo

I increased both the values frontend.max_outstanding_per_tenant and query_scheduler.max_outstanding_requests_per_tenant to 4096. I do not get any too many outstanding requests errors anymore (Loki v2.4.2, tested in test cluster as well as production cluster).

query_scheduler:
  max_outstanding_requests_per_tenant: 4096
frontend:
  max_outstanding_per_tenant: 4096
query_range:
  parallelise_shardable_queries: true
limits_config:
  split_queries_by_interval: 15m
  max_query_parallelism: 32

The default values for frontend.max_outstanding_per_tenant and query_scheduler.max_outstanding_requests_per_tenant are too low if you are using dashboards with multiple queries (multiple panels or multiple queries in one panel) over a longer time range because the queries will be split and will result in a lot of smaller sub-queries. Having multiple users using the same dashboard at the same time (or even only one user quickly refreshing the dashboard multiple times in a row) will further increase the count and you'll reach the limit even quicker. This write-up really helped me understanding the query splitting and why there are so many queries: https://taisho6339.gitbook.io/grafana-loki-deep-dive/query-process/split-a-query-into-someones and https://taisho6339.gitbook.io/grafana-loki-deep-dive/query-process/schedule-queries-to-queriers

stefan-fast avatar Jun 27 '22 08:06 stefan-fast

@stefan-fast Thank you so much for your help. By these configurations, I can confirm that the issue is fixed on Loki versions 2.5.0 and 2.4.2.

LinTechSo avatar Jun 27 '22 09:06 LinTechSo

This doesnt work with 2.6.2

Alcatros avatar Jul 01 '22 00:07 Alcatros

So if i understand correctly, the issue is cause by the default settings of.

  • limits_config.max_query_parallelism = 32
  • limits_config.split_queries_by_interval = 30m
  • query_scheduler.max_outstanding_requests_per_tenant = 100

30min * 32 gives you a time range of 16h, this is where max parallelism per query is reached. Now if a single dashboard spanning 16h runs 3 such queries at the same time , you already get 32*3 > 100, and get too many outstanding requests error? Same if several users run such queries/dashboards.

Would reducing the max_query_parallelism also help to avoid this issue?

hterik avatar Jul 04 '22 09:07 hterik

yep, i still have this problem in v2.5.0

the-elder avatar Jul 12 '22 06:07 the-elder

reporting in that I have this issue with 2.4.2, the dashboard works fine when I have one panel with 3 queries in it (they are relatively the same so they may actually be run as one query) but when I add another panel with only one query I get this error, although I'm using the one hour timeframe so it shouldnt be split into too many.

RT-Tap avatar Jul 12 '22 08:07 RT-Tap