k6 icon indicating copy to clipboard operation
k6 copied to clipboard

Support non-indexable high-cardinality metric tags / metadata

Open na-- opened this issue 3 years ago • 5 comments

Currently, the k6 system metric tags are a mix of high-cardinality ones (url, error, vu and iter) and relatively low-cardinality ones (everything else). This has already been a problem multiple times and has forced us to adopt some hacks to work around the issues, e.g. these workarounds in the built-in outputs: 1, 2, 3). On top of that, even the current high-cardinality tags are going be a big problem for the upcoming time series changes (https://github.com/grafana/k6/issues/2580), but we'll need to add even more and higher cardinality tags for things like tracing (https://github.com/grafana/k6/issues/2128), to store the trace and span IDs. Finally, k6 has no information whether user-defined custom tags have high or low cardinality.

So, because of all of these reasons, we'll need to have some way to differentiate between two distinct types of metric sample tags/metadata:

  1. low-cardinality tags that can be used for indexing... so, in the time series mentioned above, in sub-metric thresholds, in certain outputs that do aggregation based on tags, etc.
  2. and tags with high cardinality, containing metadata that's often unique for every measurement / data point

Different outputs will be free to treat these high-cardinality tags/metadata however they want - they can discard it (probably the behavior of outputs like statsd, datadod, prometheus, etc. that work with time series), sample it (e.g. to export some tracing information), transform it (e.g. in influxdb this will correspond to data point fields) or output it directly like they do to tags (e.g. json, csv and other similar outputs can just write it to the disk, since we don't do any indexing or aggregation, just dump raw data). This will finally allow us to deprecate the hacks we currently have.

The exact API changes on how we'll add this for custom tags in k6 are still under consideration, but the currently favored approach is to extend the tags API to support both strings and objects, something like this:

http.get('http://httpbin.test.k6.io/', {
  tags: {
    my_tag: {
      value: "I'm a a non-indexable tag",
      index: false
    },
    more_tag: "backwards-compatible indexable tag",
  }
);

Unfortunately, we will probably have to do a minor breaking change in the k6 core and transform some of the currently problematic (because of their high cardinality) k6 system tags like vu, iter, url into non-indexable ones by default... So thresholds based on them will no longer work. Warnings about this very likely breaking change were added to k6 v0.39.0 (https://github.com/grafana/k6/pull/2583) for scripts that have thresholds on them, and we mention them in the v0.39.0 release notes (https://github.com/grafana/k6/pull/2581).

na-- avatar Jul 04 '22 11:07 na--

While working on the latest iteration of this, I realized a few issues with the plan to make url an non-indexed tag... :disappointed: However, I think I have an easy to implement suggestion for a better alternative :crossed_fingers:

Here are the problems I saw:

  1. The biggest problem is the k6 cloud actually expects a url tag and we can't unilaterally remove this from k6 :disappointed: and probably k6 users who have already configured other external outputs
  2. The second biggest problem is that we don't emit a name tag for websocket metrics, just a url one, in either the current k6/ws code or in the new xk6-websockets one :disappointed: so we can't unilaterally say that the url is now going to be non-indexed metadata for all metrics...
  3. The third biggest problem is a UX one, setting a threshold on http_req_duration{url:whatever} is kind of nice and logical, whereas name is a k6 term that people first need to be taught about...
  4. Finally, while the k6/net/grpc API has a name tag, its url tag is actually pretty low-cardinality, so it's not a problem for atlas :tada:

So, my proposal to fix all of these issues is this:

  • keep url as an normal (index-able) tag, but always set it equal to the value of the name tag, if name was specified (which is basically what we currently do in the cloud output)
  • if name was not specified, then keep the current behavior and set both url and name to the cleaned-up URL
  • so, the url and name tag values will always be identical, but users will be able to avoid memory runoff effects from high-cardinality metrics by manually setting the name tag
  • however, if the URL value is different from the name tag, then we will save the raw high-cardinality URL value in a new non-indexed raw_url metadata entry, so it's still accessible to users that want it (e.g. in the JSON output)

This will leave us with (mostly) sensible defaults and with an easy way for users to avoid high-cardinality metrics. It will also minimize the disruption to users who already have thresholds on url sub-metrics, only some of them will need to adjust them.

The code for this even looks relatively clean :sweat_smile: https://github.com/grafana/k6/blob/6ed8668af8eba6669e79d141f8e3ec942fa8e367/lib/netext/httpext/transport.go#L84-L108

And if everyone agrees with this approach, we should probably merge a small PR before k6 v0.40.0 is released that rolls back these parts of https://github.com/grafana/k6/pull/2583: https://github.com/grafana/k6/blob/b89a3aaa6bff07907140e6b31880971aeef3570b/metrics/engine/engine.go#L93-L99 https://github.com/grafana/k6/blob/b89a3aaa6bff07907140e6b31880971aeef3570b/cmd/integration_test.go#L318-L320

(edit: created PR for v0.40.0 to remove the warning: https://github.com/grafana/k6/pull/2655)

na-- avatar Aug 18 '22 15:08 na--

Can you remember the reasons we were particularly against this in the original discussion?

mstoykov avatar Aug 18 '22 15:08 mstoykov

No... :disappointed: I tried to find some discussion on the topic, but I don't think we delved that much into it, we mostly discussed other details.

na-- avatar Aug 18 '22 15:08 na--

I'm not getting when the url as an indexable tag will be useful per-se, all the cases seem covered by the name tag and when it will be different we will move to Metadata. For the WebSocket case, I guess is it just a feature that we can add for consistency? Regarding the thresholds, I guess in that case the user would set the name so I'm not sure the UX is better because the threshold will use the low-cardinality value so the name.

codebien avatar Aug 18 '22 16:08 codebien

To put it another way - with https://github.com/grafana/k6/issues/2584#issuecomment-1219618111, we get all of the benefits of https://github.com/grafana/k6/issues/2584#issue-1293064253, but with a much smaller breaking change.

We get a way to have low-cardinality URLs, users just have to set the name tag or use the http.url helper. But we don't break the majority of existing tests that might use a threshold or sub-metric with the url tag. If a script has just a handful of URLs, you don't need to mess with name, it should just work as it is with url, since it will be low-cardinality. And if it isn't, users will have an easy way to fix it with name.

So, only the people with many URLs or with dynamic URLs need to bother with name, everyone else is fine and doesn't need to change anything. And most people with such scripts probably already use name, so they also don't have to do anything. Even more importantly, the k6 cloud continues to work exactly as before, no changes needed there before we can merge https://github.com/grafana/k6/pull/2654.

In summary, it seems to me that it's a much smaller breaking change with basically the same benefits.

I'm not sure the UX is better

Setting a threshold on http_req_duration{url:https://test.k6.io/whatever} seems like a much more understandable thing for a beginner k6 user than setting a threshold on http_req_duration{name:https://test.k6.io/whatever}. URL is a valid concept outside of k6, while a "name" is not. And for the majority of scripts, this will probably be fine, since they don't have thousands of URLs. The rest will need to understand URL grouping and use name, but this is basically already the case if you want to have a threshold on them... The only change is that now the url tag will always be equal to name, instead of just some of the time.

For the WebSocket case, I guess is it just a feature that we can add for consistency?

Sure, I am not against adding it, but we don't have it yet. So, unless we add it right now and release it in v0.40.0, in both websocket APIs, there will be no transitional k6 version that has both name and url websocket tags for people to gracefully migrate. But it's a moot point, since with my suggestion we don't need such a version :sweat_smile:

na-- avatar Aug 18 '22 18:08 na--

After some lengthy internal discussions, we didn't reach consensus if the JS APIs from https://github.com/grafana/k6/pull/2654 are the best one for the job. I'll try to do a longer public summary later, but suffice it to say that there were some strong and valid internal objections when it came to its UX... :sweat_smile:

However, the need to support some sort of high-cardinality metrics metadata is still there and needs to be handled, so we decided to do these things for now:

  1. Split apart the url == name tag fixes that reduce the cardinality of the url tag from https://github.com/grafana/k6/pull/2654 and merge them separately. This was already done. First as a PR (https://github.com/grafana/k6/pull/2703) on top of the atlas branch (https://github.com/grafana/k6/pull/2594). Merging it there allowed us to merge the atlas branch in master without a worry that the k6 built-in system tags will cause high cardinality issues.
  2. Spit apart other mostly refactoring and validation parts from https://github.com/grafana/k6/pull/2654, like ApplyCustomUserTags(), but without any Metadata or JS API changes, and merge them separately in master.
  3. Make another PR that adds support for Metadata in every metrics.Sample Go struct, but without any direct access to it from the JS APIs. So, no custom user-specified metadata, for now. However, this Metadata map[string]string value will be usable by the k6 core code (e.g. for things like iter, vu and maybe raw_url) and by xk6 extensions!
  4. (Everything :arrow_up: would hopefully make it in k6 v0.41.0 at the end of the month, everything below would be nice to have, but unlikely to make it in time.)
  5. Iterate over possible JS APIs and try to deliver a more user-friendly and less error-prone API than non-indexable tags, i.e. something better than tags: { foo: {value: 'bar', index: false}}. We have some ideas we'll iterate on internally and share in the issue later, but it's likely we'll start with some k6/experimental stand-alone API, instead of trying to thread it in every JS HTTP/WS/gRPC/etc. call.

na-- avatar Oct 06 '22 10:10 na--