Support for updating disks
Is your feature request related to a problem? Please describe. When you want to update e.g. the IOPS or MBps settings of a disk, BOSH creates a new disk with the new properties and copies the data from the old disk to the new one. This can take a long time when you have multiple TiB of data stored on the disk. Most cloud providers support updating an existing disk so it might not be necessary to migrate the data to a completely new disk.
One example is as stated above, changing the IOPS and throughput values of a disk (support by e.g. AWS and Azure) https://github.com/cloudfoundry/bosh-aws-cpi-release/issues/137
Another example is to migrate an Azure Standard HDD disk to Premium which is directly supported by Azure. Or changing from GP2 to GP3 on AWS.
Describe the solution you'd like Create a new update_disk_iaas_specific method in this block https://github.com/cloudfoundry/bosh/blob/be3df0993c10d0ae2de05a650df9a52ec08c05f0/src/bosh-director/lib/bosh/director/disk_manager.rb#L35
update_disk_iaas_specific would try to call update_disk in the cloud specific CPI (e.g. BOSH Azure CPI). If update_disk exists, it would check if the update is possible and raise Bosh::Clouds::NotSupported if the change is not possible without a migration.
In addition, adding a new configuration parameter similar to enable_cpi_resize_disk to enable this new behaviour, default would be disabled.
Describe alternatives you've considered Alternative is to manually update the disks to the new parameters and somehow modify the BOSH state in the database but that is very error prone.
Additional context
I am happy to create a draft PR for this feature in the BOSH director and do an implementation of this feature in the BOSH Azure CPI
Sorry for the delayed response.
Yeah, I can't think of any other ways of making this work. It's unfortunately a change with several touch points, although I haven't looked at them too closely yet.
- Director will need the new functionality, and will have to figure out if it's something the CPI can do. Hopefully it's mostly just copy/paste from the resize disk though.
- Agent might need to be involved in the process, possibly unmounting the disk before the changes can happen. But I'd expect that functionality to already exist unless it's somehow tightly coupled to "resize_disk"
- Then of course any CPIs that want to support it will need to be updated.
At least none of the pieces seem like they'll have to have major changes to support this.
@jpalermo I started to work on this a few weeks ago, planning to test it in the next couple of weeks. The Azure feature itself is not yet available so there is no urgency at the moment. But if you could have a look if what I have done so far looks good, that would be highly appreciated :)
https://github.com/cloudfoundry/bosh/compare/main...MSSedusch:bosh:update_disk?expand=1
@MSSedusch did you manage to test the feature?
not yet - we are also waiting for a new feature on the Microsoft side, at least for Pv1 -> Pv2 we need that feature.
I have picked up the work @MSSedusch started and completed the implementation of the new update_disk method for the director. This method has been added to the bosh-cpi-ruby gem and implemented in the bosh-azure-cpi. There are several pull requests associated with this work item awaiting review:
- https://github.com/cloudfoundry/bosh/pull/2555
- https://github.com/cloudfoundry/bosh-cpi-ruby/pull/11
- https://github.com/cloudfoundry/bosh-azure-cpi-release/pull/697
- https://github.com/cloudfoundry/docs-bosh/pull/850
I managed to conduct a few manual tests on a BOSH Director deployed on Azure, built from a development release incorporating my changes. I tested using a zookeeper deployment, modifying disk size and properties, and found that upgrade scenarios were successful with the IaaS native disk update feature. Importantly, in all these scenarios, there was no create_disk call; but the existing disk was updated.
-
Scenario 1: Change Disk Type from
Standard_LRStoPremium_LRSrequest.body:
PATCH {"sku":{"name":"Premium_LRS"}} -
Scenario 2: Change Disk Type from
Premium_LRStoPremiumV2_LRSand set cloud properties:iops: 4000,mbps: 200request.body:
PATCH {"properties":{"diskIOPSReadWrite":4000,"diskMBpsReadWrite":200},"sku":{"name":"PremiumV2_LRS"}} -
Scenario 3: With Disk Type
PremiumV2_LRSchange the cloud properties toiops: 5000,mbps: 300and disk size to 12 GiB.request.body:
PATCH {"properties":{"diskSizeGB":12,"diskIOPSReadWrite":5000,"diskMBpsReadWrite":300},"sku":{"name":"PremiumV2_LRS"}}
I will be out of office for the next 10 days. During this period, responses to any review might be delayed. However, I eagerly anticipate your feedback and will address it as soon as possible.
looks like this is done in the meantime?
Not yet. The PR https://github.com/cloudfoundry/bosh/pull/2555 has been reverted with https://github.com/cloudfoundry/bosh/pull/2570 because of issues described in https://github.com/cloudfoundry/bosh/pull/2555#issuecomment-2377993796. @jpalermo, @a-hassanin, @s4heid did something happened here in the meantime to bring the reverted change back?
Not yet. The PR #2555 has been reverted with #2570 because of issues described in #2555 (comment). @jpalermo, @a-hassanin, @s4heid did something happened here in the meantime to bring the reverted change back?
@jpalermo wanted to check if the tests only needed a quick fix, did you manage to have a look ?
I picked up the work again and will provide a fix for the failing integration tests soon.
@jpalermo @a-hassanin here's the fix for the integration tests: https://github.com/cloudfoundry/bosh/pull/2581
Apologies for any inconvenience caused. Setting up my workstation for the tests took much longer than anticipated.
@s4heid - no worries. The setup is such a pain that we mostly rely on CI. I wish there was better tooling to get started.
I'll take a look at the new PR this week.
I just had a chance to test the latest bosh release which incorporates the updates, confirming that the native disk update works as anticipated on Azure. By activating the director.enable_cpi_update_disk property, bosh successfully reattached the existing disk (maintaining the same CID) to the VM without creating a new one.
enable_cpi_update_disk: true
$ bosh -e bosh task 20 --debug | bosh-ext debug-task -
Director Version Task ID Lines Unknown Lines
280.1.10 20 1246 0
1 details
Started at Ended at Duration Name
Fri Nov 22 15:10:19 UTC 2024 Fri Nov 22 15:10:19 UTC 2024 19µs 0x16f44
Fri Nov 22 15:10:20 UTC 2024 Fri Nov 22 15:12:37 UTC 2024 2m17.320049s -
~ Fri Nov 22 15:12:37 UTC 2024 2m17.318559s task:20
Fri Nov 22 15:10:29 UTC 2024 Fri Nov 22 15:10:29 UTC 2024 28.338ms binding agent state for (zookeeper/54cdcdc2-8494-4ad3-9283-034a5047766b (0)
~ Fri Nov 22 15:10:29 UTC 2024 25.203ms binding agent state for (zookeeper/76b1b14e-cd1f-47f5-8687-3de160ae6ceb (1)
Fri Nov 22 15:10:31 UTC 2024 Fri Nov 22 15:11:36 UTC 2024 1m5.036647s canary_update(zookeeper/54cdcdc2-8494-4ad3-9283-034a5047766b (0))
Fri Nov 22 15:10:50 UTC 2024 Fri Nov 22 15:12:20 UTC 2024 1m30.015167s task:20-checkpoint
Fri Nov 22 15:11:36 UTC 2024 Fri Nov 22 15:12:37 UTC 2024 1m1.215318s instance_update(zookeeper/76b1b14e-cd1f-47f5-8687-3de160ae6ceb (1))
No create_disk calls in the CPI log:
$ bosh -e bosh task 20 --cpi | grep create_disk
enable_cpi_update_disk: false
$ bosh -e bosh task 28 --debug | bosh-ext debug-task -
Director Version Task ID Lines Unknown Lines
280.1.10 28 1292 0
1 details
Started at Ended at Duration Name
Fri Nov 22 16:39:25 UTC 2024 Fri Nov 22 16:39:25 UTC 2024 18µs 0x4aec
~ Fri Nov 22 16:41:47 UTC 2024 2m21.423272s -
~ Fri Nov 22 16:41:47 UTC 2024 2m21.421777s task:28
Fri Nov 22 16:39:34 UTC 2024 Fri Nov 22 16:39:34 UTC 2024 15.959ms binding agent state for (zookeeper/76b1b14e-cd1f-47f5-8687-3de160ae6ceb (1)
~ Fri Nov 22 16:39:34 UTC 2024 13.54ms binding agent state for (zookeeper/54cdcdc2-8494-4ad3-9283-034a5047766b (0)
Fri Nov 22 16:39:36 UTC 2024 Fri Nov 22 16:40:48 UTC 2024 1m11.43286s canary_update(zookeeper/54cdcdc2-8494-4ad3-9283-034a5047766b (0))
Fri Nov 22 16:39:55 UTC 2024 Fri Nov 22 16:41:25 UTC 2024 1m30.01514s task:28-checkpoint
Fri Nov 22 16:40:48 UTC 2024 Fri Nov 22 16:41:46 UTC 2024 58.793101s instance_update(zookeeper/76b1b14e-cd1f-47f5-8687-3de160ae6ceb (1))
create_disk was executed twice for the two instances in the test deployment:
$ bosh -e bosh task 28 --cpi | grep create_disk
I, [2024-11-22T16:39:43.168050 #23949] INFO -- [req_id cpi-557985]: Starting create_disk...
I, [2024-11-22T16:39:44.766623 #23949 #2400] INFO -- [req_id cpi-557985]: create_disk(caching:None;disk_name:bosh-disk-data-3e7a5c9d-cdda-4239-8015-97da81ec3448;resource_group_name:rg-sh-iu, eastus, 32, Standard_LRS, 1)
I, [2024-11-22T16:39:47.263708 #23949] INFO -- [req_id cpi-557985]: Finished create_disk in 4.1 seconds
I, [2024-11-22T16:40:53.855216 #24038] INFO -- [req_id cpi-780706]: Starting create_disk...
I, [2024-11-22T16:40:55.463145 #24038 #2400] INFO -- [req_id cpi-780706]: create_disk(caching:None;disk_name:bosh-disk-data-5ce81810-a397-4f17-b647-3cc40cef4349;resource_group_name:rg-sh-iu, eastus, 32, Standard_LRS, 2)
I, [2024-11-22T16:40:58.120556 #24038] INFO -- [req_id cpi-780706]: Finished create_disk in 4.27 seconds
Looking at the numbers you might be wondering why the deploy times are so similar. This is because the disk creation process only took 4 seconds. The more time-consuming aspect, which wasn't included in my basic testing, involves data transfer, as the disks in my test environment were essentially empty.
It's important to mention for Azure users that while the disk remains accessible during the update, performance may temporarily decrease due to the ongoing background process.
I believe we can consider this issue resolved, @MSSedusch.
thanks @s4heid for the contribution!