loki icon indicating copy to clipboard operation
loki copied to clipboard

Support creating histograms using logql

Open chancez opened this issue 3 years ago • 13 comments

Is your feature request related to a problem? Please describe. I'd like to be able to create histograms using logql based on an extracted field. We want to track job/rpc durations/latencies with histograms on a per-user basis and doing this with labels would have too high of a cardinality.

Describe the solution you'd like The design in the original draft of logql v2 seems reasonable: https://github.com/cyriltovena/loki/blob/a87cf44829841958a3dd78030266ac93eb7269c3/docs/design-documents/2020-03-logql-fields.md#series--histogram-operators

Describe alternatives you've considered Using cloudwatch, which has the capabilities for this. But we're already sending logs to Loki so that's not ideal.

Additional context Add any other context or screenshots about the feature request here.

chancez avatar Nov 18 '20 23:11 chancez

I think the original histogram doc was largely to enable quantile_over_time. For example, the p99 per user could look something like:

quantile_over_time(0.99,
  {cluster="ops-tools1",container="ingress-nginx"}
    | json
    | __error__ = ""
    | unwrap request_time [1m])) by (user)

Does that work?

owen-d avatar Nov 19 '20 13:11 owen-d

We specifically want to specify our buckets. We're trying to measure our job's RPC request duration/latencies during execution and we want to know what percentage of the RPCs fall into specific duration/latency bands. For example, the percentage/number of RPCs that took longer than 5ms, by user. Another example getting the distribution of a value per-RPC, or for some number of RPCs in a given window. We have a compiler RPC service which has logs containing the number of instructions in a program per compile job, and we want to get the distribution of instruction counts, and be able to filter down to a specific user.

chancez avatar Nov 19 '20 17:11 chancez

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Dec 20 '20 01:12 stale[bot]

Not stale. After holidays discussion can continue.

chancez avatar Dec 20 '20 03:12 chancez

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jan 20 '21 00:01 stale[bot]

Not stale.

chancez avatar Jan 20 '21 20:01 chancez

Keepalive at least until we can get some more feedback during bug scrub

owen-d avatar Jan 20 '21 23:01 owen-d

Could this look like

buckets_over_time([<le1>, <le2>, <le3>], 
  {<selectors>}
  [10m]
)

Any chance you'd like to try and implement this feature? I could work you through it, but alternatively I'm not sure when we're going to prioritize this. General consensus is that we'd accept this.

owen-d avatar Apr 15 '21 13:04 owen-d

I'd potentially be interested. I think logically, I'd start with the parser/lexer and I'd probably just start by looking at how quantile_over_time is implemented. I can follow up in slack.

chancez avatar Apr 15 '21 17:04 chancez

@chancez Did you find the time to try to write something for "buckets_over_time" logQL function ? Thanks

jcdauchy-moodys avatar Jul 21 '21 05:07 jcdauchy-moodys

Nope, I started on it, but had to focus on other things, and my new role requires approval to do FOSS work, so I'm unlikely to get back to this in the near future.

chancez avatar Jul 21 '21 17:07 chancez

The syntax could be like this, allowing to add as many buckets as needed.

buckets_over_time({container="query-frontend"} |= "metrics.go" | logfmt | unwrap duration(duration)[1m],1,10,100) by (type)

cyriltovena avatar Nov 15 '21 10:11 cyriltovena

For those waiting for this to be implemented, I found a very slow, hacky way of faking buckets_over_time by abusing go text templating in label_format:

Screenshot 2022-09-15 at 14 01 47

It'd be great to have a proper way to do this though!

pharaujo avatar Sep 15 '22 13:09 pharaujo

I would like to make a histogram of response times for a webserver. Is that possible with Loki?

(I'm not looking for p99 or anything like that, I just want to see how many requests are in each bucket)

mrbrianevans avatar Sep 30 '22 17:09 mrbrianevans

@mrbrianevans There's currently no function to generate the buckets (that's what this issue is about). That said, that's exactly what I did in my previous reply. I extracted individual times from the webserver logs, and created a new le bucket label where I use go txt/template to conditionally assign bucket values to the log line. Then I do sum(count_over_time(...)) by (le) to get the count of requests in each bucket. Do note that doing this is terribly slow and resource-heavy.

pharaujo avatar Oct 03 '22 10:10 pharaujo

The visualisation in your previous post looks like a heatmap of request latency over time. I am looking for simply a histogram of all requests in the currently selected time period. Eg something like this: image

which shows how many requests fall into each bucket of latency

mrbrianevans avatar Oct 03 '22 11:10 mrbrianevans

I haven't tried that, but IMO the concept would be the same; you'd do [$__range] instead of [1s] and use a different panel type to visualize.

pharaujo avatar Oct 03 '22 11:10 pharaujo