docs icon indicating copy to clipboard operation
docs copied to clipboard

Add metrics documentation

Open saad-ali opened this issue 5 years ago • 13 comments
trafficstars

I need to add documentation to https://kubernetes-csi.github.io/docs/sidecar-containers.html

Background:

A new CSI Metrics Library was added to csi-lib-utils in and is part of v0.7.0 release. This library can be used to automatically generate Prometheus metrics for all CSI operations including total count, error count, and call latency. This library was integrated in to the following CSI Sidecar containers:

  • https://github.com/kubernetes-csi/external-provisioner/pull/388
  • https://github.com/kubernetes-csi/external-attacher/pull/201
  • https://github.com/kubernetes-csi/external-snapshotter/pull/227
  • https://github.com/kubernetes-csi/external-resizer/pull/67

New flags “--metrics-address” or “--metrics-path” are now part of all 4 of those sidecars. Driver deployments should set those flags to ensure the metrics are being emitted.

saad-ali avatar Jun 19 '20 00:06 saad-ali

It would be good have a short example how those metrics can be used. Not sure whether that belongs into that documentation (which is probably more reference-oriented) or into a blog post.

pohly avatar Jun 19 '20 06:06 pohly

For a full example, integration with Prometheus and a Grafana dashboard would be useful. While investigating this, I found: https://github.com/helm/charts/tree/master/stable/prometheus#scraping-pod-metrics-via-annotations

But that only works for a single metrics endpoint per pod. When running external-provisioner, external-attacher, external-snapshotter and external-resizer all in the same statefulset and thus pod it won't be that easy, right?

pohly avatar Jun 19 '20 06:06 pohly

See https://github.com/prometheus/prometheus/issues/3756

pohly avatar Jun 19 '20 07:06 pohly

CSI calls issued by kubelet are not exported yet?

pohly avatar Jun 19 '20 11:06 pohly

Would it make sense for CSI drivers to export the same function count metric?

The code in https://github.com/saad-ali/csi-lib-utils/blob/e9a22428988a90ba8d833b5e235fcd22d16cd5fa/metrics/metrics.go currently doesn't support that:

  • only has an interceptor for the gRPC client, but not the server
  • hard-codes "csi_sidecar" as subsystem

The subsystem string then appears in metrics names like csi_sidecar_operations_seconds_count.

I could imagine that correlating those different counts may be useful, for example to detect when calls have problems at the transport level and don't reach the CSI driver.

pohly avatar Jun 19 '20 14:06 pohly

After having read through the config documentation I believe I understand enough of it to replace or extend the example configuration such that it scrapes each sidecar container individually.

But then the problem remains that admins will have to add that to their Prometheus configuration. I don't see an easy way to do that when deploying through helm. If I understand it right, one can replace the entire default config, but not add to it.

pohly avatar Jun 19 '20 18:06 pohly

If I understand it right, one can replace the entire default config, but not add to it.

That turned out to be wrong. There is some limited support for extending the default configuration.

I found a solution with an additional, generic scrape config and filed https://github.com/helm/charts/issues/22899 to figure out whether that is something that should be supported by the Helm chart out-of-the-box.

pohly avatar Jun 22 '20 10:06 pohly

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Sep 20 '20 11:09 fejta-bot

/remove-lifecycle stale

pohly avatar Sep 21 '20 09:09 pohly

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot avatar Dec 20 '20 10:12 fejta-bot

/remove-lifecycle stale /lifecycle frozen

pohly avatar Dec 20 '20 16:12 pohly

/help

msau42 avatar Aug 05 '22 22:08 msau42

@msau42: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Aug 05 '22 22:08 k8s-ci-robot