sig-storage-lib-external-provisioner icon indicating copy to clipboard operation
sig-storage-lib-external-provisioner copied to clipboard

Add dataSource label to pvcs created

Open RaunakShah opened this issue 3 years ago • 3 comments

Follow up to https://github.com/kubernetes-csi/external-provisioner/pull/792

This PR adds a label to the existing metrics to determine the data source of the PVC being monitored

Testing

  1. Deploy host path driver with external-provisioner using these changes. Deploy an nginx pod to query metrics:
% kubectl get pod -o wide       
NAME                   READY   STATUS    RESTARTS   AGE     IP            NODE              NOMINATED NODE   READINESS GATES
csi-hostpath-socat-0   1/1     Running   0          2d23h   10.244.1.6    csi-prow-worker   <none>           <none>
csi-hostpathplugin-0   8/8     Running   0          7m25s   10.244.1.31   csi-prow-worker   <none>           <none>
nginx                  1/1     Running   0          2d1h    10.244.1.18   csi-prow-worker   <none>           <none>
  1. Create a PVC and VolumeSnapshot of the PVC.
% kubectl get pvc
NAMESPACE   NAME            STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      AGE
default     csi-pvc         Bound     pvc-7351461a-8bf4-4042-9a2e-228cf5d06624   1Gi        RWO            csi-hostpath-sc   7m21s
% kubectl get volumesnapshot
NAME                READYTOUSE   SOURCEPVC   SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS            SNAPSHOTCONTENT                                    CREATIONTIME   AGE
new-snapshot-demo   true         csi-pvc                             1Gi           csi-hostpath-snapclass   snapcontent-9e8b8619-eafd-42c7-93ed-0dcf09443a4d   7m37s          7m37s
  1. Create a PVC from above VolumeSnapshot that will succeed.
% kubectl get pvc -A        
NAMESPACE   NAME            STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      AGE
default     csi-pvc         Bound     pvc-7351461a-8bf4-4042-9a2e-228cf5d06624   1Gi        RWO            csi-hostpath-sc   8m14s
default     hpvc-restore    Bound     pvc-a52a8b54-6aed-4b5c-b9a3-24ec0d848cdd   1Gi        RWO            csi-hostpath-sc   7m54s
  1. Create a PVC from above VolumeSnapshot that will fail.
% kubectl get pvc -A        
NAMESPACE   NAME            STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      AGE
default     csi-pvc         Bound     pvc-7351461a-8bf4-4042-9a2e-228cf5d06624   1Gi        RWO            csi-hostpath-sc   8m14s
default     hpvc-restor-e   Pending                                                                        csi-hostpath-sc   6m50s
default     hpvc-restore    Bound     pvc-a52a8b54-6aed-4b5c-b9a3-24ec0d848cdd   1Gi        RWO            csi-hostpath-sc   7m54s
  1. Exec into nginx pod and query metrics
% kubectl exec -it nginx -- bash
root@nginx:/# curl 10.244.1.31:8080/metrics
# HELP controller_persistentvolumeclaim_provision_duration_seconds Latency in seconds to provision persistent volumes. Failed provisioning attempts are ignored. Broken down by storage class name and source of the claim.
# TYPE controller_persistentvolumeclaim_provision_duration_seconds histogram
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="",le="0.005"} 0
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="",le="0.01"} 0
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="",le="0.025"} 0
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="",le="0.05"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="",le="0.1"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="",le="0.25"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="",le="0.5"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="",le="1"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="",le="2.5"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="",le="5"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="",le="10"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="",le="+Inf"} 1
controller_persistentvolumeclaim_provision_duration_seconds_sum{class="csi-hostpath-sc",source=""} 0.033238542
controller_persistentvolumeclaim_provision_duration_seconds_count{class="csi-hostpath-sc",source=""} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="VolumeSnapshot",le="0.005"} 0
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="VolumeSnapshot",le="0.01"} 0
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="VolumeSnapshot",le="0.025"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="VolumeSnapshot",le="0.05"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="VolumeSnapshot",le="0.1"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="VolumeSnapshot",le="0.25"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="VolumeSnapshot",le="0.5"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="VolumeSnapshot",le="1"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="VolumeSnapshot",le="2.5"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="VolumeSnapshot",le="5"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="VolumeSnapshot",le="10"} 1
controller_persistentvolumeclaim_provision_duration_seconds_bucket{class="csi-hostpath-sc",source="VolumeSnapshot",le="+Inf"} 1
controller_persistentvolumeclaim_provision_duration_seconds_sum{class="csi-hostpath-sc",source="VolumeSnapshot"} 0.022875916
controller_persistentvolumeclaim_provision_duration_seconds_count{class="csi-hostpath-sc",source="VolumeSnapshot"} 1
# HELP controller_persistentvolumeclaim_provision_failed_total Total number of persistent volume provision failed attempts. Broken down by storage class name and source of the claim.
# TYPE controller_persistentvolumeclaim_provision_failed_total counter
controller_persistentvolumeclaim_provision_failed_total{class="csi-hostpath-sc",source="VolumeSnapshot"} 4
# HELP controller_persistentvolumeclaim_provision_total Total number of persistent volumes provisioned succesfully. Broken down by storage class name and source of the claim.
# TYPE controller_persistentvolumeclaim_provision_total counter
controller_persistentvolumeclaim_provision_total{class="csi-hostpath-sc",source=""} 1
controller_persistentvolumeclaim_provision_total{class="csi-hostpath-sc",source="VolumeSnapshot"} 1
...

RaunakShah avatar Sep 30 '22 06:09 RaunakShah

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: RaunakShah Once this PR has been reviewed and has the lgtm label, please assign jsafrane for approval by writing /assign @jsafrane in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Sep 30 '22 06:09 k8s-ci-robot

/assign @gnufied Is it OK to add a new label to an existing metric?

jsafrane avatar Sep 30 '22 13:09 jsafrane

Is it OK to add a new label to an existing metric?

I was unsure too but until https://github.com/kubernetes-csi/external-provisioner/pull/792 it seems like there was no user of the metrics exposed in this package

RaunakShah avatar Oct 03 '22 06:10 RaunakShah

Let's check with sig-instrumentation whether this will break existing metric.

xing-yang avatar Nov 30 '22 18:11 xing-yang

a Kubernetes blog says that:

Stable metrics can be guaranteed to not change, [...] no labels can be added or removed from this metric

It uses registration from github.com/prometheus, which does not allow setting of stability level, so I think it's a stable metric: https://github.com/kubernetes-sigs/sig-storage-lib-external-provisioner/blob/5fd51cfade8c066f42fc33300c3fe700b0f67cae/controller/metrics/metrics.go#L74-L80

jsafrane avatar Dec 01 '22 12:12 jsafrane

@jsafrane from the blog -

When a StabilityLevel is not explicitly set, metrics default to “Alpha” stability

Having said that, there's also a note on deprecating metrics in case a label needs to be added:

In order to add or remove a label from an existing stable metric, one would have to introduce a new metric and deprecate the stable one

Do we want to consider deprecating the existing metrics and introducing new ones? The bonus is that we can potentially move to the kubeMetrics package which allows us to set the StabilityLevel and follow the Kubernetes metrics stability framework.

RaunakShah avatar Dec 05 '22 08:12 RaunakShah

Hi @logicalhan Can you please comment on this? Are these PVC/PV related metrics alpha or stable?

xing-yang avatar Dec 05 '22 17:12 xing-yang

Yes, deprecating the old metric + introducing a new one is IMO the right way forward.

jsafrane avatar Dec 06 '22 12:12 jsafrane

The stability guarantees that we have for Kubernetes only apply to k/k since the static analysis tools that make sure metrics don't change are only run there.

You could reuse that same framework here, but that would mean updating all your metrics to be created by component-base/metrics package and run the stability guarantee checks.

Going back to this PR, since you are not reusing the stability framework here, I would expect all the metrics to be considered Alpha so without any guarantees. Plus adding a label to a metric isn't a breaking change, so as long as you are mindful of the additional cardinality that it brings, it should be pretty safe to do and shouldn't require to go through a deprecation cycle.

dgrisonnet avatar Dec 16 '22 14:12 dgrisonnet

@dgrisonnet thank you for the info! @jsafrane are we okay moving ahead with this PR then? We can also move to the components-base/metrics package here to introduce the stability guarantees

RaunakShah avatar Dec 23 '22 05:12 RaunakShah

/lgtm /approve

jsafrane avatar Jan 02 '23 16:01 jsafrane

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jsafrane, RaunakShah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Jan 02 '23 16:01 k8s-ci-robot