loki icon indicating copy to clipboard operation
loki copied to clipboard

maximum of series (500) reached for a single query

Open loganmc10 opened this issue 3 years ago • 37 comments

Describe the bug I am trying to collect Netflow using Loki. I am trying to create a "Top 10 Destinations" panel. The query has a large number of different series, and I get the error mentioned in the title.

To Reproduce sum by (SrcAddr,DstAddr) (sum_over_time({job="netflow", SrcAddr=~"$srcaddr", DstAddr=~"$dstaddr", SamplerAddress=~"$exporter", Proto=~"$protocol", DstPort=~"$dstport"} | json | unwrap Bits[$__interval]))

Expected behavior I would expect it to return the first 500 series rather than just completely error out. I'm not sure how I can get around this? In order to calculate the top 10, I need to get all the values and then figure out which ones are largest. I know that Loki includes a topk function, but that calculates based on each interval, not on the total over the whole time range

Environment: Grafana Cloud

loganmc10 avatar Dec 06 '20 17:12 loganmc10

I guess perhaps what I'm looking for is this: https://github.com/grafana/grafana/issues/25561

I need some way to select "top N series", since I have more than 500 series

https://www.robustperception.io/graph-top-n-time-series-in-grafana

loganmc10 avatar Dec 06 '20 22:12 loganmc10

See https://github.com/grafana/loki/pull/3030#discussion_r537352770 I'm explaining how.

cyriltovena avatar Dec 07 '20 09:12 cyriltovena

That part makes sense, but it still doesn't solve this problem. That allows me to group, but if I still have over 500 series, it's going to error out. There should be some way to select top N series over the time range

loganmc10 avatar Dec 07 '20 13:12 loganmc10

You can remove that limit if you want for now. It is in the limit configuration.

I'm thinking about this in the meantime, it's not as easy as it looks.

cyriltovena avatar Dec 07 '20 14:12 cyriltovena

I'm using Grafana Cloud, which seems to use the default limit of 500. I believe this is basically what I'm talking about: https://www.robustperception.io/graph-top-n-time-series-in-grafana

In Grafana, it requires the ability to create a variable by doing something this: query_result(topk(5, sum_over_time({job="netflow"} | json | unwrap Bytes[${__range_s}s]))) This should return the top 5 conversations by total Bytes over the range specified in Grafana. Using this I can create a template variable that filters by these labels.

However, Loki doesn't currently support the ${__range_s} variable in Grafana (it only works with Prometheus). It's also unclear if it supports query_result() (not mentioned in the documentation).

I don't know exactly how these things interact, if support for __range and query_result() are things related only to Grafana, I think I can close this and open something for Grafana instead

loganmc10 avatar Dec 07 '20 14:12 loganmc10

Yep totally understand, not sure I want to go in the same direction that Prometheus did.

I'll bring this up to the team.

cyriltovena avatar Dec 07 '20 15:12 cyriltovena

This is a problem I've encountered also, namely the ability to limit after grouping is done, eg.

This produces 'maximum of series' error: sum by (addr) (count_over_time({job="syslog"} | regexp "(?P<addr>[0-9a-f:.]{6,})"[5m]))

This does not error on max series, but the top 20 counts per range are taken from each grouping and summed. It doesn't seem possible to get the top 20 counts after being summed by addr (and also not possible, as @loganmc10 says, to set the range to be the time span you're querying for): topk(20, sum by (addr) (count_over_time({job="syslog"} | regexp "(?P<addr>[0-9a-f:.]{6,})"[5m])))

If I'm misunderstanding the issue, I apologise, but it seems to be the same

afletch avatar Jan 27 '21 10:01 afletch

Small update during bug scrub: There are two issues here,

  1. We'd like Grafana to interpolate ${__range_s} in upstream queries
  2. @wardbekker has started some work on a probabilistic topk implementation, which should alleviate some of this, but it's a problem we're keen on tackling.

owen-d avatar May 06 '21 13:05 owen-d

I'm getting the same error in my query, however I'm a n3wb to logql so I might just need to optimize my query. Context: I have a web proxy which I'm simply seeking to show a rate on high duration ("D" in RED). Here's my query:

rate({namespace="ambassador", container="ambassador"}[1m] | regexp ".* \"(?P<method>\\w+)\\s+(?P<path>.*)\\s+(?P<protocol>HTTP\\/\\d+(\\.\\d+)?)\"\\s+(?P<status>\\d{3})\\s+(?P<status_flags>[\\d+-])\\s+(?P<bytes_received>\\d+)\\s+(?P<bytes_sent>\\d+)\\s+(?P<duration>\\d+)\\s+(?P<unk>\\d+)\\s+\"(?P<xforward_for>.*)\"\\s+\"(?P<user_agent>.*)\"\\s+\"(?P<request_id>.*)\"\\s+\"(?P<host>.*)\"\\s+\"(?P<upstream_endpoint>.*)\"" | status == 200 | duration > 300)

What I find most interesting is that when I try this query what I end up getting (testing in explorer) is a table of value which renders quickly and then disappears replaced with the error. If I replace the duration value to something like 600, the same thing happens but it takes longer for the error to occur. I can think of reasons why this might happen, but I'm curious if this is not an actual bug.

Other than this issue, I'm not finding anything on the googs that addresses this error. I tried slack grafana #loki but got no responses to my question. I've tried optimizing this query, but everything I try (admittedly probably not great) fails to work.

notjames avatar Jun 02 '21 05:06 notjames

You need to sum by ... to only aggregate on a specific set of label

cyriltovena avatar Jun 02 '21 06:06 cyriltovena

We're really close to having a solution for this. Grafana now supports the $__range variable for Loki, but it still doesn't support query_result() when making dashboard variables.

If it supported query_result() then we'd be able to mimic the Prometheus solution in order to provide a Loki query with a topk list of labels.

loganmc10 avatar Aug 13 '21 01:08 loganmc10

Same issue here. This query errors out with max series 500 error. The strange thing is this query worked perfectly fine for weeks, then all of a sudden started to do this.

topk(15, sum by (ClientRequestPath) (count_over_time({domain=~"$domain"} | json | ClientRequestURI !~".*(ico|js|jpg|jpeg|png|webp|css|webmanifest)" | ClientRequestPath!~"/|/index.xml|/robots.txt|/xmlrpc.php" [$__interval])))

B3DTech avatar Aug 19 '21 20:08 B3DTech

I just hit this specific issue as well, any news regarding work being done on it?

mrbobq avatar Jan 18 '22 14:01 mrbobq

I just hit this specific issue as well, any news regarding work being done on it?

ghost avatar Mar 13 '22 02:03 ghost

I face the same issue here. My query worked fine for few month ago, but now it always show this error when i try set $__interval to about 1d or more. Anyone have idea for this ? Here my query: count(sum by(request_data_credential)(count_over_time({swarm_service="service name",swarm_stack="stack name"} | json | request_type="authForUser" and status=200 and request_data_type!="bot"[$__interval])))

congchinh262 avatar Apr 27 '22 03:04 congchinh262

Hello, The same here. We have updated to 2.5 and helm chart 0.39.3 => 0.48.3. The following stopped to work: sum by (...) (count_over_time(...)[24h]) Previously count_over_time(...)[24h] was not able to return >500 series, but adding sum by (...) this count_over_time was transformed into <500 series and normally returned into grafana without any error. Now the maximum of series (500) reached for a single query hits with or without sum by (...).

Can somebody tell if it is a bug, or this was a bug before and it is fixed?

andrejshapal avatar Apr 28 '22 23:04 andrejshapal

Hi, any updates ?

LinTechSo avatar Jun 07 '22 21:06 LinTechSo

same issue.

wkshare avatar Jun 15 '22 03:06 wkshare

Facing this issue now ~.~

JohnDotOwl avatar Jun 19 '22 05:06 JohnDotOwl

facing same issue

Harshrai3112 avatar Jun 20 '22 05:06 Harshrai3112

Same issue here

kennie98 avatar Jun 21 '22 13:06 kennie98

Hi, any updates ?

@cyriltovena Would you please give an update about this issue ? because i used sum by sum by (remote_addr) (rate({pod="$app", cluster="$cluster", namespace="$namespace" , tag="$tag"} | json | line_format "{{ .log }}" | pattern `<_> - - <_> "<request> <_> <_>" <status> <body_bytes_sent> "<http_referer>" "<http_user_agent>" <remote_addr>` | status = 200 | __error__ != "JSONParserErr" | unwrap body_bytes_sent [$__interval]) )

I used this query but apparently, the problem still persists.

LinTechSo avatar Jun 26 '22 10:06 LinTechSo

https://community.grafana.com/t/maximum-of-series-500-reached-for-a-single-query/64117/4 Hi,

Just go to the loki-local-config.yaml, then find the limits_config configuration. Then modify this to the limits_config:

limits_config:
   reject_old_samples: true
   reject_old_samples_max_age: 168h
   retention_period: 360h
   max_query_series: 100000
   max_query_parallelism: 2

default value of the max_query_series is 500, so make it bigger as you need. max_query_parallelism is the max CPU that you can use for parallelism

Regards, Fadjar Tandabawana

fadjar340 avatar Jul 08 '22 11:07 fadjar340

Any update to solve this issue for grafana cloud users?

alexisbel1 avatar Mar 15 '23 13:03 alexisbel1

If you're getting this on Grafana Cloud you should open a support ticket.

MaxDiOrio avatar Mar 15 '23 14:03 MaxDiOrio

I'm not sure I understand.

I have this simple query: count(rate({app="myapp"} |= `` | json | __error__=`` | request_time < 0.3 [5m])). I'm only looking at the last 15 minutes worth of data, in that interval . Maybe I don't understand what the error message mean, but I thought the "series" part referred to how many buckets there is in the result, but that can't be the case since I should only have 3 buckets with 15 minutes range and 5 minutes interval.

So what am I missing? Is there a way around it except changing max_query_series? How do I know that I use a large enough value for that property so I don't hit this issue again?

mastoj avatar Mar 20 '23 20:03 mastoj

The problem here has to do with query planning and unfortunately will differ query by query. First off, the max_query_series limit prevents the number of series being returned by a query. This has ramifications on memory pressure (allocating memory for i.e. millions of series could crash loki), performance, and usability (rendering thousands of series in a grafana panel is can freeze up a web browser). This is important for loki because we can extract labels from the contents of logs, creating series.

The second part of this equation is query planning. Loki creates many smaller "sharded" queries, runs them in parallel, and aggregates the results before returning the response to the end-user. There's a lot of complexity into how this actually works, when/how sharding can be applied, etc, but there are some cases where the entire query cannot be "pushed down" and aggregated on the queriers. count(<foo>) is one of these.

In this case, it actually ends up something like

count(
  downstream<
    rate(
      {foo="bar"} | json[1m]
    ),
    shard=0_of_2
  > ++
  downstream<
    rate(
      {foo="bar"} | json[1m]
    ),
    shard=1_of_2
  >
)

In this case, the query which is actually executed on each downstream querier is a simple rate query. This is important, because while count aggregates away all of the underlying series, returning a single series that is the count of all series passed to it, rate does no such aggregation.

What happens here is the rate creates a ton of series by extracting labels from the json stage which triggers the max_query_series limit and fails the query.

Suggestion Looking at the query you provided, count(rate({app="myapp"} |= `` | json | __error__=`` | request_time < 0.3 [5m])), using count isn't likely what you want. count returns how many series exist. sum(rate({app="myapp"} |= `` | json | __error__=`` | request_time < 0.3 [5m])) is likely what you want as it'll return the per-second rate over the last 5 minutes for myapp logs with under 0.3 request_times.

owen-d avatar Mar 28 '23 17:03 owen-d

@owen-d , thank you for a very elaborate answer!

mastoj avatar Mar 28 '23 19:03 mastoj

Hi,

I have 99998 data records loaded in period between 09:06 and 09:10. No more records after this time. All records have label exporter="OLTP" and other label templateId with up to 5000 unique values. I have INSTANT query: topk(5, sum(count_over_time({exporter="OTLP"}[$__range])) by(templateId)) It does return results when I select period between 09:00 and 09:15 or between 09:00 and 09:30 It starts returning "maximum of series (500) reached for a single query" error when I select period between 09:00 and 09:45 or between 09::00 and 10:00 and up.

How [$__range] size is related to the number of series? Is this expected behavior?

anushauskas avatar Aug 02 '23 09:08 anushauskas

I am also having this problem. What is the max "acceptable" value for max_query_series parameter?

bmgante avatar Nov 28 '23 20:11 bmgante