loki
loki copied to clipboard
Support creating histograms using logql
Is your feature request related to a problem? Please describe. I'd like to be able to create histograms using logql based on an extracted field. We want to track job/rpc durations/latencies with histograms on a per-user basis and doing this with labels would have too high of a cardinality.
Describe the solution you'd like The design in the original draft of logql v2 seems reasonable: https://github.com/cyriltovena/loki/blob/a87cf44829841958a3dd78030266ac93eb7269c3/docs/design-documents/2020-03-logql-fields.md#series--histogram-operators
Describe alternatives you've considered Using cloudwatch, which has the capabilities for this. But we're already sending logs to Loki so that's not ideal.
Additional context Add any other context or screenshots about the feature request here.
I think the original histogram doc was largely to enable quantile_over_time
. For example, the p99 per user could look something like:
quantile_over_time(0.99,
{cluster="ops-tools1",container="ingress-nginx"}
| json
| __error__ = ""
| unwrap request_time [1m])) by (user)
Does that work?
We specifically want to specify our buckets. We're trying to measure our job's RPC request duration/latencies during execution and we want to know what percentage of the RPCs fall into specific duration/latency bands. For example, the percentage/number of RPCs that took longer than 5ms, by user. Another example getting the distribution of a value per-RPC, or for some number of RPCs in a given window. We have a compiler RPC service which has logs containing the number of instructions in a program per compile job, and we want to get the distribution of instruction counts, and be able to filter down to a specific user.
This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.
Not stale. After holidays discussion can continue.
This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.
Not stale.
Keepalive at least until we can get some more feedback during bug scrub
Could this look like
buckets_over_time([<le1>, <le2>, <le3>],
{<selectors>}
[10m]
)
Any chance you'd like to try and implement this feature? I could work you through it, but alternatively I'm not sure when we're going to prioritize this. General consensus is that we'd accept this.
I'd potentially be interested. I think logically, I'd start with the parser/lexer and I'd probably just start by looking at how quantile_over_time is implemented. I can follow up in slack.
@chancez Did you find the time to try to write something for "buckets_over_time" logQL function ? Thanks
Nope, I started on it, but had to focus on other things, and my new role requires approval to do FOSS work, so I'm unlikely to get back to this in the near future.
The syntax could be like this, allowing to add as many buckets as needed.
buckets_over_time({container="query-frontend"} |= "metrics.go" | logfmt | unwrap duration(duration)[1m],1,10,100) by (type)
For those waiting for this to be implemented, I found a very slow, hacky way of faking buckets_over_time
by abusing go text templating in label_format
:
It'd be great to have a proper way to do this though!
I would like to make a histogram of response times for a webserver. Is that possible with Loki?
(I'm not looking for p99 or anything like that, I just want to see how many requests are in each bucket)
@mrbrianevans There's currently no function to generate the buckets (that's what this issue is about). That said, that's exactly what I did in my previous reply. I extracted individual times from the webserver logs, and created a new le
bucket label where I use go txt/template to conditionally assign bucket values to the log line. Then I do sum(count_over_time(...)) by (le)
to get the count of requests in each bucket. Do note that doing this is terribly slow and resource-heavy.
The visualisation in your previous post looks like a heatmap of request latency over time. I am looking for simply a histogram of all requests in the currently selected time period. Eg something like this:
which shows how many requests fall into each bucket of latency
I haven't tried that, but IMO the concept would be the same; you'd do [$__range]
instead of [1s]
and use a different panel type to visualize.