sig-storage-lib-external-provisioner
sig-storage-lib-external-provisioner copied to clipboard
Add dataSource label to pvcs created
Follow up to https://github.com/kubernetes-csi/external-provisioner/pull/792
This PR adds a label to the existing metrics to determine the data source of the PVC being monitored
Testing
- Deploy host path driver with external-provisioner using these changes. Deploy an nginx pod to query metrics:
% kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
csi-hostpath-socat-0 1/1 Running 0 2d23h 10.244.1.6 csi-prow-worker <none> <none>
csi-hostpathplugin-0 8/8 Running 0 7m25s 10.244.1.31 csi-prow-worker <none> <none>
nginx 1/1 Running 0 2d1h 10.244.1.18 csi-prow-worker <none> <none>
- Create a PVC and VolumeSnapshot of the PVC.
% kubectl get pvc
NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
default csi-pvc Bound pvc-7351461a-8bf4-4042-9a2e-228cf5d06624 1Gi RWO csi-hostpath-sc 7m21s
% kubectl get volumesnapshot
NAME READYTOUSE SOURCEPVC SOURCESNAPSHOTCONTENT RESTORESIZE SNAPSHOTCLASS SNAPSHOTCONTENT CREATIONTIME AGE
new-snapshot-demo true csi-pvc 1Gi csi-hostpath-snapclass snapcontent-9e8b8619-eafd-42c7-93ed-0dcf09443a4d 7m37s 7m37s
- Create a PVC from above VolumeSnapshot that will succeed.
% kubectl get pvc -A
NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
default csi-pvc Bound pvc-7351461a-8bf4-4042-9a2e-228cf5d06624 1Gi RWO csi-hostpath-sc 8m14s
default hpvc-restore Bound pvc-a52a8b54-6aed-4b5c-b9a3-24ec0d848cdd 1Gi RWO csi-hostpath-sc 7m54s
- Create a PVC from above VolumeSnapshot that will fail.
% kubectl get pvc -A
NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
default csi-pvc Bound pvc-7351461a-8bf4-4042-9a2e-228cf5d06624 1Gi RWO csi-hostpath-sc 8m14s
default hpvc-restor-e Pending csi-hostpath-sc 6m50s
default hpvc-restore Bound pvc-a52a8b54-6aed-4b5c-b9a3-24ec0d848cdd 1Gi RWO csi-hostpath-sc 7m54s
- Exec into nginx pod and query metrics
% kubectl exec -it nginx -- bash
root@nginx:/# curl 10.244.1.31:8080/metrics
# HELP controller_persistentvolumeclaim_provision_duration_seconds Latency in seconds to provision persistent volumes. Failed provisioning attempts are ignored. Broken down by storage class name and source of the claim.
# TYPE controller_persistentvolumeclaim_provision_duration_seconds histogram
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="",le="0.005"} 0
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="",le="0.01"} 0
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="",le="0.025"} 0
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="",le="0.05"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="",le="0.1"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="",le="0.25"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="",le="0.5"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="",le="1"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="",le="2.5"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="",le="5"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="",le="10"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="",le="+Inf"} 1
controller_persistentvolumeclaim_provision_duration_seconds_sum{class="csi-hostpath-sc",source=""} 0.033238542
controller_persistentvolumeclaim_provision_duration_seconds_count{class="csi-hostpath-sc",source=""} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="VolumeSnapshot",le="0.005"} 0
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="VolumeSnapshot",le="0.01"} 0
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="VolumeSnapshot",le="0.025"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="VolumeSnapshot",le="0.05"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="VolumeSnapshot",le="0.1"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="VolumeSnapshot",le="0.25"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="VolumeSnapshot",le="0.5"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="VolumeSnapshot",le="1"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="VolumeSnapshot",le="2.5"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="VolumeSnapshot",le="5"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="VolumeSnapshot",le="10"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="VolumeSnapshot",le="+Inf"} 1
controller_persistentvolumeclaim_provision_duration_seconds_sum{class="csi-hostpath-sc",source="VolumeSnapshot"} 0.022875916
controller_persistentvolumeclaim_provision_duration_seconds_count{class="csi-hostpath-sc",source="VolumeSnapshot"} 1
# HELP controller_persistentvolumeclaim_provision_failed_total Total number of persistent volume provision failed attempts. Broken down by storage class name and source of the claim.
# TYPE controller_persistentvolumeclaim_provision_failed_total counter
controller_persistentvolumeclaim_provision_failed_total{class="csi-hostpath-sc",source="VolumeSnapshot"} 4
# HELP controller_persistentvolumeclaim_provision_total Total number of persistent volumes provisioned succesfully. Broken down by storage class name and source of the claim.
# TYPE controller_persistentvolumeclaim_provision_total counter
controller_persistentvolumeclaim_provision_total{class="csi-hostpath-sc",source=""} 1
controller_persistentvolumeclaim_provision_total{class="csi-hostpath-sc",source="VolumeSnapshot"} 1
...
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: RaunakShah
Once this PR has been reviewed and has the lgtm label, please assign jsafrane for approval by writing /assign @jsafrane in a comment. For more information see:The Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
/assign @gnufied Is it OK to add a new label to an existing metric?
Is it OK to add a new label to an existing metric?
I was unsure too but until https://github.com/kubernetes-csi/external-provisioner/pull/792 it seems like there was no user of the metrics exposed in this package
Let's check with sig-instrumentation whether this will break existing metric.
a Kubernetes blog says that:
Stable metrics can be guaranteed to not change, [...] no labels can be added or removed from this metric
It uses registration from github.com/prometheus, which does not allow setting of stability level, so I think it's a stable metric: https://github.com/kubernetes-sigs/sig-storage-lib-external-provisioner/blob/5fd51cfade8c066f42fc33300c3fe700b0f67cae/controller/metrics/metrics.go#L74-L80
@jsafrane from the blog -
When a StabilityLevel is not explicitly set, metrics default to “Alpha” stability
Having said that, there's also a note on deprecating metrics in case a label needs to be added:
In order to add or remove a label from an existing stable metric, one would have to introduce a new metric and deprecate the stable one
Do we want to consider deprecating the existing metrics and introducing new ones? The bonus is that we can potentially move to the kubeMetrics package which allows us to set the StabilityLevel and follow the Kubernetes metrics stability framework.
Hi @logicalhan Can you please comment on this? Are these PVC/PV related metrics alpha or stable?
Yes, deprecating the old metric + introducing a new one is IMO the right way forward.
The stability guarantees that we have for Kubernetes only apply to k/k since the static analysis tools that make sure metrics don't change are only run there.
You could reuse that same framework here, but that would mean updating all your metrics to be created by component-base/metrics package and run the stability guarantee checks.
Going back to this PR, since you are not reusing the stability framework here, I would expect all the metrics to be considered Alpha so without any guarantees. Plus adding a label to a metric isn't a breaking change, so as long as you are mindful of the additional cardinality that it brings, it should be pretty safe to do and shouldn't require to go through a deprecation cycle.
@dgrisonnet thank you for the info!
@jsafrane are we okay moving ahead with this PR then? We can also move to the components-base/metrics package here to introduce the stability guarantees
/lgtm /approve
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: jsafrane, RaunakShah
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [jsafrane]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment