docs
docs copied to clipboard
Add metrics documentation
I need to add documentation to https://kubernetes-csi.github.io/docs/sidecar-containers.html
Background:
A new CSI Metrics Library was added to csi-lib-utils in and is part of v0.7.0 release. This library can be used to automatically generate Prometheus metrics for all CSI operations including total count, error count, and call latency. This library was integrated in to the following CSI Sidecar containers:
- https://github.com/kubernetes-csi/external-provisioner/pull/388
- https://github.com/kubernetes-csi/external-attacher/pull/201
- https://github.com/kubernetes-csi/external-snapshotter/pull/227
- https://github.com/kubernetes-csi/external-resizer/pull/67
New flags “--metrics-address” or “--metrics-path” are now part of all 4 of those sidecars. Driver deployments should set those flags to ensure the metrics are being emitted.
It would be good have a short example how those metrics can be used. Not sure whether that belongs into that documentation (which is probably more reference-oriented) or into a blog post.
For a full example, integration with Prometheus and a Grafana dashboard would be useful. While investigating this, I found: https://github.com/helm/charts/tree/master/stable/prometheus#scraping-pod-metrics-via-annotations
But that only works for a single metrics endpoint per pod. When running external-provisioner, external-attacher, external-snapshotter and external-resizer all in the same statefulset and thus pod it won't be that easy, right?
See https://github.com/prometheus/prometheus/issues/3756
CSI calls issued by kubelet are not exported yet?
Would it make sense for CSI drivers to export the same function count metric?
The code in https://github.com/saad-ali/csi-lib-utils/blob/e9a22428988a90ba8d833b5e235fcd22d16cd5fa/metrics/metrics.go currently doesn't support that:
- only has an interceptor for the gRPC client, but not the server
- hard-codes "csi_sidecar" as subsystem
The subsystem string then appears in metrics names like csi_sidecar_operations_seconds_count.
I could imagine that correlating those different counts may be useful, for example to detect when calls have problems at the transport level and don't reach the CSI driver.
After having read through the config documentation I believe I understand enough of it to replace or extend the example configuration such that it scrapes each sidecar container individually.
But then the problem remains that admins will have to add that to their Prometheus configuration. I don't see an easy way to do that when deploying through helm. If I understand it right, one can replace the entire default config, but not add to it.
If I understand it right, one can replace the entire default config, but not add to it.
That turned out to be wrong. There is some limited support for extending the default configuration.
I found a solution with an additional, generic scrape config and filed https://github.com/helm/charts/issues/22899 to figure out whether that is something that should be supported by the Helm chart out-of-the-box.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale /lifecycle frozen
/help
@msau42: This request has been marked as needing help from a contributor.
Guidelines
Please ensure that the issue body includes answers to the following questions:
- Why are we solving this issue?
- To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
- Does this issue have zero to low barrier of entry?
- How can the assignee reach out to you for help?
For more details on the requirements of such an issue, please see here and ensure that they are met.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.
In response to this:
/help
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.