volsync icon indicating copy to clipboard operation
volsync copied to clipboard

ReplicationSource cacheCapacity space avialble metric

Open reefland opened this issue 11 months ago • 2 comments

Describe the feature you'd like to have. I'm was trying to have existing PrometheusRules to alert me if the volsync cache PVCs are not sized appropriately.

Then I noticed that I do not see PVCs created by volsync visible within Prometheus. Perhaps this is just how volsync uses PVC's and kubelet can't gather metrics on PVCs not actively mounted.

What is the value to the end user? (why is it a priority?) The docs state This volume contains cached metadata from the backup repository. It must be large enough to hold the non-pruned repository metadata.

  • I do not know how much space is being used by Restic metadata or how that changes over time
  • I would like to bump up the cache size before the volume fills up and volsync backups are impacted

How will we know we have a good solution? (acceptance criteria) I'm going to assume that volsync does not normally mount cache PVCs (and thus kubelet can't not report on it). If this is true, perhaps when trigger.schedule event happens would it be possible for volsync to then emit its own metric with cache capacity? perhaps percent free? Something like volsync_cache_capacity_available

maybe "-1" if unknown (no event triggered), otherwise a number between 0 and 100 as a percentage of capacity left.

Then I can have an alert like:

- alert: VolSyncCacheVolumeCapacityLow
  annotation:
    summary: >-
        {{ $labels.obj_namespace }}/{{ $labels.obj_name }} cache volume space is almost full. 
        Increase size of cacheCapacity value.
    description: >-
        {{ $labels.obj_namespace }}/{{ $labels.obj_name }} cache volume space is < 15%.
        VALUE = {{ $value }}
    expr: |
      volsync_cache_capacity_available > -1 and volsync_cache_capacity_available < 15
    for: 15m
    labels:
      severity: critical

reefland avatar Mar 07 '24 18:03 reefland

The VolSync controller doesn't mount a restic cache PVC itself, it's mounted to the mover pod from the job that runs during a sync however. Can you see stats for when the mover job is running?

As such, I'm not sure we want to try to capture this usage data and have it sent back to the controller to emit as events.

Depending on your CSI driver, maybe it's possible to get some stats via volume health monitoring? https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1432-volume-health-monitor#kubelet-metrics-changes

I've never looked into this myself, but looks like potentially there could be VolumeUsage reported.

tesshuflower avatar Mar 07 '24 21:03 tesshuflower

The kubelet_volume_stats_* series of metrics contain the data I want such as used_bytes or capacity_bytes but none of the PVCs created by volsync are listed. Perhaps the mover pods have the cache volume mounted so briefly it hasn't happened when kubelet is fetching data?

kube_persistentvolume_capacity_bytes does include PVCs created by volsync, but only total capacity of the volume is available. kube_persistentvolume_* series of metric do not contain any use information.

I was unable to locate anything about "VolumeUsage" other than above.

reefland avatar Mar 08 '24 00:03 reefland