trickster icon indicating copy to clipboard operation
trickster copied to clipboard

Documentation: x-trickster-result & status=proxy-only

Open disfluxly opened this issue 5 years ago • 7 comments

I can't seem to find any documentation around the possible values in the x-trickster-result response header.

I've been checking this to attempt to debug why some queries seem to hit cache, while others seem to only go through HttpProxy mode, but not knowing the possible values and the reasons why those values may appear makes this difficult.

For example, on the same grafana dashboard, some panels are getting cache phit while other panels are getting status=proxy-only.

I have tracing enabled, but it doesn't appear that there's a lot of attribution on the traces. All I can see is that, unlike other traces where the request is passed to DeltaProxyCacheRequest, this request is immediately passed to ProxyRequest.

image

In addition, I don't see any logs (in DEBUG mode) when a query is status=proxy-only.

It'd be nice to be able to know why a query was proxied and not cached. I think this should both be set as an attribute on the parent span and also within DEBUG logs.

disfluxly avatar Sep 23 '20 13:09 disfluxly

One reason that you might get a proxy-only is that a user requests a very old time range that is not in the configured cacheable window. We already drop a debug log when this happens, but can also add a span attribute for the "range is too old" condition.

The other reasons would be backend provider-specific.

With Prometheus, we pretty much trust that any request against /query is object cacheable and against /query_range is timeseries cacheable, unless the "too old" condition applies.

For providers that use a query language where the time range is embedded into the query (as opposed to using separate URL parameters - InfluxDB and ClickHouse), we actually perform a rough parse of the query in order to extract those attributes, since they must be manipulated based on what is in cache versus what is still needed, and then the needed ranges are swapped into the base query when making the backend request. In Trickster 1.x, all of that is based on regex matches (e.g., InfluxDB), which make a best effort to identify cacheable queries, based on common Grafana patterns. If any part of that process does not work out, we'll go ahead and just proxy the request. That is likely what you are seeing. We can definitely add a debug log and span attribute for this, but the only detail we could give in 1.x is that regex matching failed.

With Trickster 2.0, we've implemented our own extensible lexing and parsing solution for sql-ish query languages that is a little more robust, and should be able to give more detail about why it failed (e.g., when an influxql query is missing the group by time($duration) clause, it will tell you).

If you can provide examples of queries that are coming back as proxy-only (feel free to obfuscate the non-time field names), we can expand the regex to be inclusive of those patterns. Any chart that Grafana and Chronograf can render, we definitely want to be able to cache.

The X-Trickster-Result definitely needs it's own documentation page. While the Cache Statuses (like proxy-only) are documented with the Caches, we'll get a doc published to cover the main parts of the header value and its format.

jranson avatar Sep 23 '20 14:09 jranson

Gotcha, I'm using Influx so I'm betting that it's something with the parsing, although it's a query directly formulated by Grafana.

Actually, I figured out why it's doing it.

If a panel has multiple queries in it, Grafana will separate each query with a ; when it sends it to the datasource. Now, if you disable a query in that panel, Grafana will keep the ;.

I updated my test repo (https://github.com/disfluxly/trickster_docker_compose) with this. If you look at the cpu panel on the Internal dashboard you'll see what I'm talking about. There's 2 queries on there. Query A is disabled. Grafan still sends Query B as though it's multiple queries, so it's prefixed by a ;. This causes Trickster to just proxy it.

As soon as you enable Query A, caching starts to work.

How does Trickster handle multiple queries being sent? Does it split on ; and cache each individual query separately? Or does it just cache the entire multi-query string? I'd assume the former as the latter would cause a lot of problems.

disfluxly avatar Sep 23 '20 15:09 disfluxly

currently the entire query string is hashed to a single cache key and, if it is a compound query (delimited by ;), the various result sets returned are isolated into their own silos under that cache key. We do it that way because we want to pass the user's original query string 1:1 up to InfluxDB, with only the time ranges modified, just like Grafana does if Trickster were not in the path. So any time you disable or enable specific subqueries (by disabling the legend series in a grafana chart), it will create a new cache object for the new view, since the underlying query string will then hash to a different key. Not ideal, but not the end of the world, since most people are refreshing the same charts periodically and will only incidentally change the view. That would result in a one-time cache miss on the legend change, and then return to subsequent phits as the dashboard auto-refreshes. I think there is some room for optimization there, as you suggest, but it is dependent up on some technology we are adding to 2.0 to do federated dataset merges. So we'll revisit that later this year, since the only benefit would be a marginally smaller cache utilization.

We'll take a look at the updated docker-compose and make sure the regex will account for those stray semicolons and issue a 1.1.4 release soon. Since it's not generating critical panics, it may take a bit longer to circle back on, since we're heavily focused on getting the first beta of 2.0 released.

BTW, when parsing the query via regex, we only look at the first query in the list (e.g., to the first ; if present) and trust that any subsequent queries in the compound will have the same time ranges and field arrangement, since it would be really difficult for a dashboard to deviate from that.

jranson avatar Sep 23 '20 16:09 jranson

Gotcha. So when's the 2.0 beta targeted for? :)

Also, did you want me to leave this open as a reminder for fixing the miscellaneous ; in the regex & creating docs for x-trickster-result?

disfluxly avatar Sep 23 '20 19:09 disfluxly

Yeah, let's definitely leave this open while we work out the semicolon thing! I was hoping to have the first 2.0 beta out already, but it's taking a bit longer to get fully running. We're now targeting the first week of October, so stay tuned!

jranson avatar Sep 23 '20 19:09 jranson

@jranson - How's the 2.0 launch coming along?

disfluxly avatar Nov 06 '20 19:11 disfluxly

@disfluxly after a lot of work behind the scenes, we have the first Beta released yesterday! We'd love if you'd give it a try, and let us know of any issues, and if the semicolon issue is fixed up for you. Thanks for your patience.

jranson avatar Mar 09 '21 16:03 jranson