tag-observability icon indicating copy to clipboard operation
tag-observability copied to clipboard

Guidelines for developers on how to implement new metrics

Open ArthurSens opened this issue 4 years ago • 10 comments

Hello everyone! Currently, I'm a Trainee under the CNCF's Community Bridge program, working for the KubeVirt community.

My project is to improve KubeVirt's observability by implementing new metrics that represent what is going on with the KubeVirt environment. And after I got a little more comfortable with Open Source development, I also started to look for other projects that need help with observability and found Flux.

I've noticed that the first step before doing anything with open source, is to write a proposal so other people can give some comments, thoughts, and suggest changes based on best practices about what is about to be implemented. However, when discussing new metrics, we can't seem to find an agreement about some points, mostly because we cannot find a curated guideline about some issues, which I will present below.

This might be a really extensive issue, but I'd like to provide as much information as possible to present my thoughts :thinking: Hopefully, I won't bore you guys too much :grimacing:

Metrics granularity

Let me give an example. Flux can synchronize Kubernetes manifests in a git repository with a Kubernetes cluster. To do so, when configuring Flux, one needs to inform which git remote repository, which branch, and which directories' path that flux will look up to when doing that synchronization.

Flux has a metric called flux_daemon_sync_manifests with a label success=true|false, which indicates the number of manifests that are being synchronized between repository and cluster and if the sync is successful or not.

When working with a single git repository, this may be enough, but let's say that someone is using flux to sync multiple repositories with a multi-cluster/multi-cloud environment. It will be hard to pinpoint exactly where is the problem when receiving an alert with flux_daemon_sync_manifests{success="false"} > 1

To solve this, we can use some approaches:

  • Tell users to use different Prometheus/Alertmanager servers for each Flux deployment
  • Add new labels to flux metrics that will help to pinpoint problems and set up better alerts.

I think everyone agrees that we should stick with the latter. But how much information is too much?

The most obvious labels are git_repository, git_branch and path, but we could also add some not so obvious ones like manifest which will tell exactly what file failed/succeded the synchronization, and we could also split the metric into flux_daemon_sync_manifests_fails and flux_daemon_sync_manifests_success and add error to the first one indicating what was the error faced when doing the sync.

Of course, adding new labels always helps us build more detailed monitoring solutions, but they will also need more storage capacity and provoke more performance issues with PromQL(not sure about the last one).

I'm almost sure that a user can drop unnecessary labels from metrics at scrape time, but I don't think that should justify developers to add every label that could be useful to every single use-case.

Differentiate between what should be a metric or a label

I could use the same example as above. Should flux_daemon_sync_manifests be a single metric with the success label? Or should it be split in two; flux_daemon_sync_manifests_fails and flux_daemon_sync_manifests_success?

But let me give you another example:

KubeVirt is capable of deploying Virtual Machines on top of a Kubernetes Cluster, and KubeVirt exposes metrics regarding the VMs performance and resource usage.

When looking at node exporter's approach with disk metrics, it exposes metrics for read operations and another one for write operations. For example:

  • node_disk_ops_reads - for the total amount of reading operations from a disk device
  • node_disk_ops_written - for the total amount of writing operations in a disk device

KubeVirt's approach, on the other hand, is to expose a single disk metric, but with the label type=read|write.

  • kubevirt_vmi_storage_iops_total - for the total amount of operations in a disk device, has a label to differentiate read and write ops

Both approaches work just fine, but which one is better? Do they have any differences performance-wise?

Every time a developer knows that a given label's value will ALWAYS be within a pre-defined set of values, the developer can choose whether implement several metrics or just a single one with an extra identifying label.

How to work with historical data

To better explain this one, I guess I will have to show you the problem I'm facing with Kubevirt.

As previously said, KubeVirt deploys VMs on top of Kubernetes. KubeVirt can also migrate VMs between nodes, which is necessary when a node becomes slow or unresponsive. Virtual Machine Instance(VMI) and Virtual Machine Instance Migration(VMIM) are implemented as Kubernetes Custom Resource.

VMIM has some useful information like, End and Start timestamps, target node where the VMI is being migrated to, source node where the VMI is being migrated from, which migration method is being used, and they can be in different stages: Succeeded, Running, and Failed.

Every VMI and VMIM that is posted to the K8s API is stored in etcd and as long as the VMI still exists in the cluster, we can retrieve its data and expose them easily. Once we delete a VMI, the VMI and all VMIMs related to it are deleted from the etcd, and then we can't expose information about them anymore.

However, users want to analyze and correlate problems from previous VMI migrations with the existing ones, so they can identify why some migrations are taking more time than others and why they fail or succeed.

Let me try to explain it like this:

  • Each row is a time-series for a VMI migration metric
  • Each column represents 1h in the timeline
  • Let's assume that the Prometheus server was configured to keep metrics in HEAD for 3h before indexing it in the TSDB
  • o represents that the metric was collected in that particular moment of time
  • - represents that the metric was not collected in that particular moment of time
  • [ ] represents where the Prometheus HEAD is pointed to for a particular time series

We can whether:

  • Keep VMIM objects in etcd and always expose old migration metrics
    • Too much disk space/memory required for both metrics storage and etcd
    • Will always keep every migration in-memory, thus, easy to analyze.
/\
| o  o  o  o  o  o  o [o]    #migration 1 UID=1
| o  o  o  o  o  o  o [o]    #migration 2 UID=2
| o  o  o  o  o  o  o [o]    #migration 3 UID=3 (last one for that particular VMI)
------------------------->

Or

  • Remove migrations information of old VMIs from etcd
    • Less storage capacity needed
    • Will have to deal with historical data at the Prometheus' side
/\
| o  o  o  -  -  -  -  -     #migration 1 UID=1
| -  -  -  o  o [o] -  -     #migration 2 UID=2
| -  -  -  -  -  -  o [o]    #migration 3 UID=3 (last one for that particular VMI)
------------------------->

Let's say that I want to create a dashboard with information about all migrations that have ever happened within my cluster. With the first approach, a simple query like this would be enough: kubevirt_vmi_migration_metric_example, since everything is in memory.

Once a time series is removed from Prometheus' HEAD, I will have to work with queries with time ranges, most probably with remote storages like Thanos or InfluxDB as well. The queries' return will not be metric values anymore, but rather metric vectors, which require to be treated differently. It's not an impossible thing to do, but surely must be thought carefully.


I'm sure there are good solutions for everything that I'm bringing with this issue, and I'm also sure that there are several other problems that I couldn't think of right now. What I'm asking for is to have a centralized place with documentation and guidelines for developers who are trying to improve applications' observability.

Perhaps it could be study material for a future Monitoring and Observability Certification, by CNCF. :eyes:

But anyways, this will help a lot anyone who is writing proposals for Open Source projects or anyone who is trying to follow CNCF's guidelines for Cloud Native Observability.

ArthurSens avatar Jul 06 '20 19:07 ArthurSens

Thanks for this (and all the details! :heart:), putting this into our agenda for today: https://docs.google.com/document/d/1_QoF-njScSuGFI3Ge5zu-G8SbL6scQ8AzT1hq57bRoQ/edit#

You are welcome to join us for discussion!

cc @RichiH

bwplotka avatar Jul 07 '20 09:07 bwplotka

I will answer a couple of questions now, but as for today's SIG meeting, it would be awesome to think of how we can make this knowledge available across users.

  • Tell users to use different Prometheus/Alertmanager servers for each Flux deployment
  • Add new labels to flux metrics that will help to pinpoint problems and set up better alerts.

I think everyone agrees that we should stick with the latter. But how much information is too much?

Yup, if Flux gives this functionality you should instrument accordingly.

Of course, adding new labels always help us build more detailed monitoring solutions, but they will also need more storage capacity and provoke more performance issues with PromQL(not sure about the last one).

Both statements are correct. All has its price.

I'm almost sure that a user can drop unnecessary labels from metrics at scrape time, but I don't think that should justify developers to add every label that could be useful to every single use-case.

No, this would be very hard, ideally, the application itself is a friendly citizen and understand the high cardinality effect.

The most obvious labels are git_repository, git_branch and path, but we could also add some not so obvious ones like manifest which will tell exactly what file failed/succeded the synchronization, and we could also split the metric into flux_daemon_sync_manifests_fails and flux_daemon_sync_manifests_success and add error to the first one indicating what was the error faced when doing the sync.

Those are very good questions. Overall you need to ask stakeholders what questions you will be asking your monitoring system. In terms of metrics questions should be mostly around aggregated values over time. For example:

  • When sync to git repository started to fail? When did it start working again?
  • Was fail 100% or only 10% for one or more repositories?

For this, you definitely want to add git_repo label as this gives you information about what repo this is related to. Forbranch and path you can add it if you think there are questions that require this information. Otherwise, I would skip it as it increases some space, but does not impact cardinality. Unless it's changing dynamically, then we can dive more.

What matters and what increases cardinality in a way that it's unusable for metrics are unique, frequently changing info like IDs, IPs, User names, or error strings. Those metadata are usually related to events, so logs or traces are more suitable for those.

In terms of errors, there is still big value in knowing the "type" of error. Was it transient? Was it user error? Server error? Timeout? It's very common to have a label for error type to at least have graphs showing the high-level reason like HTTP status code (:

Differentiate between what should be a metric or a label I could use the same example as above. Should flux_daemon_sync_manifests be a single metric with the success label? Or should it be split in two; flux_daemon_sync_manifests_fails and flux_daemon_sync_manifests_success? (....)

Metric means usually the whole series. I think what you mean is actually metric name so abc{...}. And in fact, the name is just another label __name__ (:

All your examples and questions are very fair, but the rule of thumb is quite easy here: Just think if you would be grouping the aggregations across those labels. If yes, put it in the label, if no then the metric name is fine. Remember that a new label means a bigger index, so some overhead and cost again.

bwplotka avatar Jul 07 '20 15:07 bwplotka

Once a time series is removed from Prometheus' HEAD, I will have to work with queries with time ranges, most probably with remote storages like Thanos or InfluxDB as well. The queries' return will not be metric values anymore, but rather metric vectors, which require to be treated differently. It's not an impossible thing to do, but surely must be thought carefully.

Well you can totally have Prometheus having longer storage (weeks, months, years even) so no need for LTS for simple use cases. (: Having 30d retention is very common. After having things in "HEAD" Prometheus saves all on persistent disks.

bwplotka avatar Jul 07 '20 16:07 bwplotka

Perhaps it could be study material for a future Monitoring and Observability Certification, by CNCF. eyes

:smile:

bwplotka avatar Jul 07 '20 16:07 bwplotka

Those are very good questions. Overall you need to ask stakeholders what questions you will be asking your monitoring system. In terms of metrics questions should be mostly around aggregated values over time. For example:

  • When sync to git repository started to fail? When did it start working again?
  • Was fail 100% or only 10% for one or more repositories?

For this, you definitely want to add git_repo label as this gives you information about what repo this is related to. Forbranch and path you can add it if you think there are questions that require this information. Otherwise, I would skip it as it increases some space, but does not impact cardinality. Unless it's changing dynamically, then we can dive more.

What matters and what increases cardinality in a way that it's unusable for metrics are unique, frequently changing info like IDs, IPs, User names, or error strings. Those metadata are usually related to events, so logs or traces are more suitable for those.

In terms of errors, there is still big value in knowing the "type" of error. Was it transient? Was it user error? Server error? Timeout? It's very common to have a label for error type to at least have graphs showing the high-level reason like HTTP status code (:

Thanks a lot for this! So for flux, more labels are needed for sure, but I don't want to create labels too specific for a particular manifest. Prometheus will be responsible for the bigger picture like which repos or environments are presenting an unsatisfactory error rate.

The same thing will apply to KubeVirt. I'd rather monitor migration success rate and time spent percentiles than exactly which migrations are failing or not/taking too much time or not.

Logging and tracing will guide users to pinpoint exactly where and what is the problem!

Metric means usually the whole series. I think what you mean is actually metric name so abc{...}. And in fact, the name is just another label __name__ (:

All your examples and questions are very fair, but the rule of thumb is quite easy here: Just think if you would be grouping the aggregations across those labels. If yes, put it in the label, if no then the metric name is fine. Remember that a new label means a bigger index, so some overhead and cost again.

Got it, thanks! Where I used to work, labels were heavily used to build very specific notification messages. But again, one should use logging and tracing for that. I'll try to use labels only when data is going to be aggregated by them

Well you can totally have Prometheus having longer storage (weeks, months, years even) so no need for LTS for simple use cases. (: Having 30d retention is very common. After having things in "HEAD" Prometheus saves all on persistent disks.

Yes, but after it is indexed on persistent disk I do have to query it differently, right? my_ephemeral_metric will only return if it is still in Prometheus WAL, once it's indexed I'll have to query with my_ephemeral_metric[someRange], which will return a metric vector with value and timestamp.

But again, I guess that is also a problem of wanting to add too many labels. I was thinking of a metric for KubeVirt that will represent time spent on a VM migration, which would have the migration ID as one of its labels and that would be way too specific.

Better have migration time as percentiles and use tracing and logging to pinpoint migration problems!

ArthurSens avatar Jul 07 '20 18:07 ArthurSens

Thanks for talking about this at today's SIG meeting, too bad my internet connection went down right when that subject came up :cry:

Please let me know if I can help you guys with anything

ArthurSens avatar Jul 07 '20 18:07 ArthurSens

Yes, but after it is indexed on persistent disk I do have to query it differently, right?

No, it should be available with normal PromQL (:

once it's indexed I'll have to query with my_ephemeral_metric[someRange], which will return a metric vector with value and timestamp.

Well, that's normal thing since normal query intants asks on what is now. If you want to check something in past you can use query range or some magic like my_ephemeral_metric offset 4h. And with your approach of range, the result "which will return a metric vector with value and timestamp." is actually usually what is super useful. Just use Graph tab to see it printed nicely over time via query range instead raw data (:

But again, I guess that is also a problem of wanting to add too many labels. I was thinking of a metric for KubeVirt that will represent time spent on a VM migration, which would have the migration ID as one of its labels and that would be way too specific.

Yea, migration ID is too cardinal, but you can definitely aggregate those data across multiple migrations and have immediate feedback how many of those are failing or slow which is quite important. After all, do you care for particular migration ID or those problematic ones and scale of the problem? Usually metrics gives answer on the latter (:

Thanks for talking about this at today's SIG meeting

Your welcome! If you want we can give you our initial docs for review when ready (:

bwplotka avatar Jul 07 '20 19:07 bwplotka

Your welcome! If you want we can give you our initial docs for review when ready (:

Sure, I'll keep an eye on everything that this SIG is doing and give any feedback that I think that could be relevant

ArthurSens avatar Jul 07 '20 20:07 ArthurSens

Added some early docs, PTAL https://github.com/cncf/sig-observability/pull/21 (:

bwplotka avatar Jul 20 '20 19:07 bwplotka

Moving to v1.1 whitepaper wishlist. Help wanted! (:

bwplotka avatar Aug 01 '23 08:08 bwplotka