aws-ebs-csi-driver icon indicating copy to clipboard operation
aws-ebs-csi-driver copied to clipboard

[Feature Request] Volume Autoscaling (Resize EBS Disk Size)

Open andrii29 opened this issue 9 months ago • 7 comments

Is your feature request related to a problem? Please describe. AWS EBS volumes support increasing disk size, but this must be done manually. Currently, this process is not automated. When the final storage requirements of an application are uncertain, it's preferable to allocate less storage initially and scale up as needed.

Describe the solution you'd like in detail There are existing Kubelet metrics that can be used to calculate disk usage:

(kubelet_volume_stats_capacity_bytes - kubelet_volume_stats_available_bytes) / kubelet_volume_stats_capacity_bytes

These metrics can be retrieved from the kube-prometheus-stack by providing a Prometheus URL or directly from Kubelet using ebs-csi-node DaemonSet. Based on these metrics, storage autoscaling can be implemented. For example, when storage usage exceeds 75% (a configurable threshold), the CSI driver should automatically increase the disk size by 20%.

Describe alternatives you've considered Create an alert for PV disk usage and manually update the PVC.

Additional context In dynamic cloud environments, supporting storage autoscaling would help optimize resource allocation and reduce costs by avoiding unnecessary over-provisioning.

andrii29 avatar Feb 21 '25 17:02 andrii29

/retitle [Feature Request] Volume Autoscaling (Resize EBS Disk Size) /kind feature

torredil avatar Feb 21 '25 18:02 torredil

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar May 22 '25 19:05 k8s-triage-robot

/remove-lifecycle stale

andrii29 avatar May 23 '25 06:05 andrii29

I came across this while looking up similar use cases: https://repost.aws/questions/QUf-VrsN50QkCTSiy97NHARg/need-help-automating-ebs-volume-scaling-at-80-capacity, that lets you alarm on your volume utilization then automates scaling using Elastic Volume EV. Would this solve your use case? Any challenges it presents?

Can you share more about your use case and scaling pattern?

ksliu58 avatar Jun 05 '25 17:06 ksliu58

While this can be used to resize volumes, it requires using multiple additional services, permissions, and installing the CloudWatch Agent (since disk usage metrics are not available in default monitoring). Meanwhile, the EBS CSI driver already runs on every node and is capable of resizing disks and filesystems on its own.

Regarding scaling: we deployed an application that requires a persistent volume, but we don't know the exact volume size in advance. So, we provision a smaller volume and resize it multiple times until the required storage size is known.

andrii29 avatar Jun 06 '25 07:06 andrii29

It makes sense to have everything in one place. The metric you listed in the original post is vended by EBS CSI driver and you should be able to scrape those via a Prometheus endpoint to set up alarms without specifically requiring CloudWatch.

For scaling, will you know the estimated max volume size in advance? For automated scaling, I imagine you will still need an upper bound to avoid incurring cost due to run-away scripts consuming excessive storage by mistake. What type of application are you deploying?

ksliu58 avatar Jun 10 '25 17:06 ksliu58

We have configured an alert for this metric, which triggers when the used disk exceeds 80% of the total capacity. However, resizing still requires a manual increase of PVC disk requests.

Currently, we are using PersistentVolumes for Prometheus (metrics storage) and VictoriaLogs (logs storage). As incoming traffic increases to the newly created cluster, there is a growing demand for larger disk sizes.

At the moment, we do not have an upper bound limit. The same metric can also be used to detect underutilized volumes with excessive free space.

It would be nice if the upper bound and disk increase percentage could be configured via annotations, and the EBS CSI driver could use these settings to apply automatic storage scaling.

andrii29 avatar Jun 10 '25 18:06 andrii29

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Sep 08 '25 18:09 k8s-triage-robot

/remove-lifecycle stale

andrii29 avatar Sep 09 '25 06:09 andrii29