loki
loki copied to clipboard
maximum of series (500) reached for a single query
Describe the bug I am trying to collect Netflow using Loki. I am trying to create a "Top 10 Destinations" panel. The query has a large number of different series, and I get the error mentioned in the title.
To Reproduce
sum by (SrcAddr,DstAddr) (sum_over_time({job="netflow", SrcAddr=~"$srcaddr", DstAddr=~"$dstaddr", SamplerAddress=~"$exporter", Proto=~"$protocol", DstPort=~"$dstport"} | json | unwrap Bits[$__interval]))
Expected behavior
I would expect it to return the first 500 series rather than just completely error out. I'm not sure how I can get around this? In order to calculate the top 10, I need to get all the values and then figure out which ones are largest. I know that Loki includes a topk
function, but that calculates based on each interval, not on the total over the whole time range
Environment: Grafana Cloud
I guess perhaps what I'm looking for is this: https://github.com/grafana/grafana/issues/25561
I need some way to select "top N series", since I have more than 500 series
https://www.robustperception.io/graph-top-n-time-series-in-grafana
See https://github.com/grafana/loki/pull/3030#discussion_r537352770 I'm explaining how.
That part makes sense, but it still doesn't solve this problem. That allows me to group, but if I still have over 500 series, it's going to error out. There should be some way to select top N series over the time range
You can remove that limit if you want for now. It is in the limit configuration.
I'm thinking about this in the meantime, it's not as easy as it looks.
I'm using Grafana Cloud, which seems to use the default limit of 500. I believe this is basically what I'm talking about: https://www.robustperception.io/graph-top-n-time-series-in-grafana
In Grafana, it requires the ability to create a variable by doing something this: query_result(topk(5, sum_over_time({job="netflow"} | json | unwrap Bytes[${__range_s}s])))
This should return the top 5 conversations by total Bytes over the range specified in Grafana. Using this I can create a template variable that filters by these labels.
However, Loki doesn't currently support the ${__range_s}
variable in Grafana (it only works with Prometheus). It's also unclear if it supports query_result()
(not mentioned in the documentation).
I don't know exactly how these things interact, if support for __range
and query_result()
are things related only to Grafana, I think I can close this and open something for Grafana instead
Yep totally understand, not sure I want to go in the same direction that Prometheus did.
I'll bring this up to the team.
This is a problem I've encountered also, namely the ability to limit after grouping is done, eg.
This produces 'maximum of series' error:
sum by (addr) (count_over_time({job="syslog"} | regexp "(?P<addr>[0-9a-f:.]{6,})"[5m]))
This does not error on max series, but the top 20 counts per range are taken from each grouping and summed. It doesn't seem possible to get the top 20 counts after being summed by addr (and also not possible, as @loganmc10 says, to set the range to be the time span you're querying for):
topk(20, sum by (addr) (count_over_time({job="syslog"} | regexp "(?P<addr>[0-9a-f:.]{6,})"[5m])))
If I'm misunderstanding the issue, I apologise, but it seems to be the same
Small update during bug scrub: There are two issues here,
- We'd like Grafana to interpolate
${__range_s}
in upstream queries - @wardbekker has started some work on a probabilistic topk implementation, which should alleviate some of this, but it's a problem we're keen on tackling.
I'm getting the same error in my query, however I'm a n3wb to logql so I might just need to optimize my query. Context: I have a web proxy which I'm simply seeking to show a rate on high duration ("D" in RED). Here's my query:
rate({namespace="ambassador", container="ambassador"}[1m] | regexp ".* \"(?P<method>\\w+)\\s+(?P<path>.*)\\s+(?P<protocol>HTTP\\/\\d+(\\.\\d+)?)\"\\s+(?P<status>\\d{3})\\s+(?P<status_flags>[\\d+-])\\s+(?P<bytes_received>\\d+)\\s+(?P<bytes_sent>\\d+)\\s+(?P<duration>\\d+)\\s+(?P<unk>\\d+)\\s+\"(?P<xforward_for>.*)\"\\s+\"(?P<user_agent>.*)\"\\s+\"(?P<request_id>.*)\"\\s+\"(?P<host>.*)\"\\s+\"(?P<upstream_endpoint>.*)\"" | status == 200 | duration > 300)
What I find most interesting is that when I try this query what I end up getting (testing in explorer) is a table of value which renders quickly and then disappears replaced with the error. If I replace the duration
value to something like 600, the same thing happens but it takes longer for the error to occur. I can think of reasons why this might happen, but I'm curious if this is not an actual bug.
Other than this issue, I'm not finding anything on the googs that addresses this error. I tried slack grafana #loki but got no responses to my question. I've tried optimizing this query, but everything I try (admittedly probably not great) fails to work.
You need to sum by ... to only aggregate on a specific set of label
We're really close to having a solution for this. Grafana now supports the $__range variable for Loki, but it still doesn't support query_result() when making dashboard variables.
If it supported query_result() then we'd be able to mimic the Prometheus solution in order to provide a Loki query with a topk list of labels.
Same issue here. This query errors out with max series 500 error. The strange thing is this query worked perfectly fine for weeks, then all of a sudden started to do this.
topk(15, sum by (ClientRequestPath) (count_over_time({domain=~"$domain"} | json | ClientRequestURI !~".*(ico|js|jpg|jpeg|png|webp|css|webmanifest)" | ClientRequestPath!~"/|/index.xml|/robots.txt|/xmlrpc.php" [$__interval])))
I just hit this specific issue as well, any news regarding work being done on it?
I just hit this specific issue as well, any news regarding work being done on it?
I face the same issue here. My query worked fine for few month ago, but now it always show this error when i try set $__interval to about 1d or more. Anyone have idea for this ? Here my query: count(sum by(request_data_credential)(count_over_time({swarm_service="service name",swarm_stack="stack name"} | json | request_type="authForUser" and status=200 and request_data_type!="bot"[$__interval])))
Hello,
The same here.
We have updated to 2.5 and helm chart 0.39.3 => 0.48.3.
The following stopped to work:
sum by (...) (count_over_time(...)[24h])
Previously count_over_time(...)[24h] was not able to return >500 series, but adding sum by (...) this count_over_time was transformed into <500 series and normally returned into grafana without any error. Now the maximum of series (500) reached for a single query
hits with or without sum by (...).
Can somebody tell if it is a bug, or this was a bug before and it is fixed?
Hi, any updates ?
same issue.
Facing this issue now ~.~
facing same issue
Same issue here
Hi, any updates ?
@cyriltovena Would you please give an update about this issue ?
because i used sum by
sum by (remote_addr) (rate({pod="$app", cluster="$cluster", namespace="$namespace" , tag="$tag"} | json | line_format "{{ .log }}" | pattern `<_> - - <_> "<request> <_> <_>" <status> <body_bytes_sent> "<http_referer>" "<http_user_agent>" <remote_addr>` | status = 200 | __error__ != "JSONParserErr" | unwrap body_bytes_sent [$__interval]) )
I used this query but apparently, the problem still persists.
https://community.grafana.com/t/maximum-of-series-500-reached-for-a-single-query/64117/4 Hi,
Just go to the loki-local-config.yaml, then find the limits_config configuration. Then modify this to the limits_config:
limits_config:
reject_old_samples: true
reject_old_samples_max_age: 168h
retention_period: 360h
max_query_series: 100000
max_query_parallelism: 2
default value of the max_query_series is 500, so make it bigger as you need. max_query_parallelism is the max CPU that you can use for parallelism
Regards, Fadjar Tandabawana
Any update to solve this issue for grafana cloud users?
If you're getting this on Grafana Cloud you should open a support ticket.
I'm not sure I understand.
I have this simple query: count(rate({app="myapp"} |= `` | json | __error__=`` | request_time < 0.3 [5m]))
. I'm only looking at the last 15 minutes worth of data, in that interval . Maybe I don't understand what the error message mean, but I thought the "series" part referred to how many buckets there is in the result, but that can't be the case since I should only have 3 buckets with 15 minutes range and 5 minutes interval.
So what am I missing? Is there a way around it except changing max_query_series? How do I know that I use a large enough value for that property so I don't hit this issue again?
The problem here has to do with query planning and unfortunately will differ query by query. First off, the max_query_series
limit prevents the number of series being returned by a query. This has ramifications on memory pressure (allocating memory for i.e. millions of series could crash loki), performance, and usability (rendering thousands of series in a grafana panel is can freeze up a web browser). This is important for loki because we can extract labels from the contents of logs, creating series.
The second part of this equation is query planning. Loki creates many smaller "sharded" queries, runs them in parallel, and aggregates the results before returning the response to the end-user. There's a lot of complexity into how this actually works, when/how sharding can be applied, etc, but there are some cases where the entire query cannot be "pushed down" and aggregated on the queriers. count(<foo>)
is one of these.
In this case, it actually ends up something like
count(
downstream<
rate(
{foo="bar"} | json[1m]
),
shard=0_of_2
> ++
downstream<
rate(
{foo="bar"} | json[1m]
),
shard=1_of_2
>
)
In this case, the query which is actually executed on each downstream querier is a simple rate
query. This is important, because while count
aggregates away all of the underlying series, returning a single series that is the count of all series passed to it, rate
does no such aggregation.
What happens here is the rate creates a ton of series by extracting labels from the json
stage which triggers the max_query_series
limit and fails the query.
Suggestion
Looking at the query you provided, count(rate({app="myapp"} |= `` | json | __error__=`` | request_time < 0.3 [5m]))
, using count
isn't likely what you want. count
returns how many series exist. sum(rate({app="myapp"} |= `` | json | __error__=`` | request_time < 0.3 [5m]))
is likely what you want as it'll return the per-second rate over the last 5 minutes for myapp
logs with under 0.3
request_times.
@owen-d , thank you for a very elaborate answer!
Hi,
I have 99998 data records loaded in period between 09:06 and 09:10. No more records after this time. All records have label exporter="OLTP" and other label templateId with up to 5000 unique values. I have INSTANT query: topk(5, sum(count_over_time({exporter="OTLP"}[$__range])) by(templateId)) It does return results when I select period between 09:00 and 09:15 or between 09:00 and 09:30 It starts returning "maximum of series (500) reached for a single query" error when I select period between 09:00 and 09:45 or between 09::00 and 10:00 and up.
How [$__range] size is related to the number of series? Is this expected behavior?
I am also having this problem. What is the max "acceptable" value for max_query_series parameter?