bosh icon indicating copy to clipboard operation
bosh copied to clipboard

Support for updating disks

Open MSSedusch opened this issue 1 year ago • 5 comments

Is your feature request related to a problem? Please describe. When you want to update e.g. the IOPS or MBps settings of a disk, BOSH creates a new disk with the new properties and copies the data from the old disk to the new one. This can take a long time when you have multiple TiB of data stored on the disk. Most cloud providers support updating an existing disk so it might not be necessary to migrate the data to a completely new disk.

One example is as stated above, changing the IOPS and throughput values of a disk (support by e.g. AWS and Azure) https://github.com/cloudfoundry/bosh-aws-cpi-release/issues/137

Another example is to migrate an Azure Standard HDD disk to Premium which is directly supported by Azure. Or changing from GP2 to GP3 on AWS.

Describe the solution you'd like Create a new update_disk_iaas_specific method in this block https://github.com/cloudfoundry/bosh/blob/be3df0993c10d0ae2de05a650df9a52ec08c05f0/src/bosh-director/lib/bosh/director/disk_manager.rb#L35

update_disk_iaas_specific would try to call update_disk in the cloud specific CPI (e.g. BOSH Azure CPI). If update_disk exists, it would check if the update is possible and raise Bosh::Clouds::NotSupported if the change is not possible without a migration.

In addition, adding a new configuration parameter similar to enable_cpi_resize_disk to enable this new behaviour, default would be disabled.

Describe alternatives you've considered Alternative is to manually update the disks to the new parameters and somehow modify the BOSH state in the database but that is very error prone.

Additional context

I am happy to create a draft PR for this feature in the BOSH director and do an implementation of this feature in the BOSH Azure CPI

MSSedusch avatar Feb 07 '24 07:02 MSSedusch

Sorry for the delayed response.

Yeah, I can't think of any other ways of making this work. It's unfortunately a change with several touch points, although I haven't looked at them too closely yet.

  • Director will need the new functionality, and will have to figure out if it's something the CPI can do. Hopefully it's mostly just copy/paste from the resize disk though.
  • Agent might need to be involved in the process, possibly unmounting the disk before the changes can happen. But I'd expect that functionality to already exist unless it's somehow tightly coupled to "resize_disk"
  • Then of course any CPIs that want to support it will need to be updated.

At least none of the pieces seem like they'll have to have major changes to support this.

jpalermo avatar Mar 18 '24 02:03 jpalermo

@jpalermo I started to work on this a few weeks ago, planning to test it in the next couple of weeks. The Azure feature itself is not yet available so there is no urgency at the moment. But if you could have a look if what I have done so far looks good, that would be highly appreciated :)

https://github.com/cloudfoundry/bosh/compare/main...MSSedusch:bosh:update_disk?expand=1

MSSedusch avatar Mar 18 '24 08:03 MSSedusch

@MSSedusch did you manage to test the feature?

stefanator12 avatar May 15 '24 10:05 stefanator12

not yet - we are also waiting for a new feature on the Microsoft side, at least for Pv1 -> Pv2 we need that feature.

MSSedusch avatar May 15 '24 11:05 MSSedusch

I have picked up the work @MSSedusch started and completed the implementation of the new update_disk method for the director. This method has been added to the bosh-cpi-ruby gem and implemented in the bosh-azure-cpi. There are several pull requests associated with this work item awaiting review:

  • https://github.com/cloudfoundry/bosh/pull/2555
  • https://github.com/cloudfoundry/bosh-cpi-ruby/pull/11
  • https://github.com/cloudfoundry/bosh-azure-cpi-release/pull/697
  • https://github.com/cloudfoundry/docs-bosh/pull/850

I managed to conduct a few manual tests on a BOSH Director deployed on Azure, built from a development release incorporating my changes. I tested using a zookeeper deployment, modifying disk size and properties, and found that upgrade scenarios were successful with the IaaS native disk update feature. Importantly, in all these scenarios, there was no create_disk call; but the existing disk was updated.

  • Scenario 1: Change Disk Type from Standard_LRS to Premium_LRS

    request.body:

    PATCH {"sku":{"name":"Premium_LRS"}}
    
  • Scenario 2: Change Disk Type from Premium_LRS to PremiumV2_LRS and set cloud properties: iops: 4000, mbps: 200

    request.body:

    PATCH {"properties":{"diskIOPSReadWrite":4000,"diskMBpsReadWrite":200},"sku":{"name":"PremiumV2_LRS"}}
    
  • Scenario 3: With Disk Type PremiumV2_LRS change the cloud properties to iops: 5000, mbps: 300 and disk size to 12 GiB.

    request.body:

    PATCH {"properties":{"diskSizeGB":12,"diskIOPSReadWrite":5000,"diskMBpsReadWrite":300},"sku":{"name":"PremiumV2_LRS"}}
    

I will be out of office for the next 10 days. During this period, responses to any review might be delayed. However, I eagerly anticipate your feedback and will address it as soon as possible.

s4heid avatar Aug 27 '24 14:08 s4heid

looks like this is done in the meantime?

jsievers avatar Oct 29 '24 13:10 jsievers

Not yet. The PR https://github.com/cloudfoundry/bosh/pull/2555 has been reverted with https://github.com/cloudfoundry/bosh/pull/2570 because of issues described in https://github.com/cloudfoundry/bosh/pull/2555#issuecomment-2377993796. @jpalermo, @a-hassanin, @s4heid did something happened here in the meantime to bring the reverted change back?

beyhan avatar Oct 29 '24 13:10 beyhan

Not yet. The PR #2555 has been reverted with #2570 because of issues described in #2555 (comment). @jpalermo, @a-hassanin, @s4heid did something happened here in the meantime to bring the reverted change back?

@jpalermo wanted to check if the tests only needed a quick fix, did you manage to have a look ?

a-hassanin avatar Oct 31 '24 07:10 a-hassanin

I picked up the work again and will provide a fix for the failing integration tests soon.

s4heid avatar Oct 31 '24 13:10 s4heid

@jpalermo @a-hassanin here's the fix for the integration tests: https://github.com/cloudfoundry/bosh/pull/2581

Apologies for any inconvenience caused. Setting up my workstation for the tests took much longer than anticipated.

s4heid avatar Oct 31 '24 17:10 s4heid

@s4heid - no worries. The setup is such a pain that we mostly rely on CI. I wish there was better tooling to get started.

I'll take a look at the new PR this week.

aramprice avatar Nov 05 '24 00:11 aramprice

I just had a chance to test the latest bosh release which incorporates the updates, confirming that the native disk update works as anticipated on Azure. By activating the director.enable_cpi_update_disk property, bosh successfully reattached the existing disk (maintaining the same CID) to the VM without creating a new one.

enable_cpi_update_disk: true
$ bosh -e bosh task 20 --debug | bosh-ext debug-task -
Director Version  Task ID  Lines  Unknown Lines  
280.1.10          20       1246   0  

1 details

Started at                    Ended at                      Duration      Name  
Fri Nov 22 15:10:19 UTC 2024  Fri Nov 22 15:10:19 UTC 2024  19µs         0x16f44  
Fri Nov 22 15:10:20 UTC 2024  Fri Nov 22 15:12:37 UTC 2024  2m17.320049s  -  
~                             Fri Nov 22 15:12:37 UTC 2024  2m17.318559s  task:20  
Fri Nov 22 15:10:29 UTC 2024  Fri Nov 22 15:10:29 UTC 2024  28.338ms      binding agent state for (zookeeper/54cdcdc2-8494-4ad3-9283-034a5047766b (0)  
~                             Fri Nov 22 15:10:29 UTC 2024  25.203ms      binding agent state for (zookeeper/76b1b14e-cd1f-47f5-8687-3de160ae6ceb (1)  
Fri Nov 22 15:10:31 UTC 2024  Fri Nov 22 15:11:36 UTC 2024  1m5.036647s   canary_update(zookeeper/54cdcdc2-8494-4ad3-9283-034a5047766b (0))  
Fri Nov 22 15:10:50 UTC 2024  Fri Nov 22 15:12:20 UTC 2024  1m30.015167s  task:20-checkpoint  
Fri Nov 22 15:11:36 UTC 2024  Fri Nov 22 15:12:37 UTC 2024  1m1.215318s   instance_update(zookeeper/76b1b14e-cd1f-47f5-8687-3de160ae6ceb (1))

No create_disk calls in the CPI log:

$ bosh -e bosh task 20 --cpi | grep create_disk
enable_cpi_update_disk: false
$ bosh -e bosh task 28 --debug | bosh-ext debug-task -
Director Version  Task ID  Lines  Unknown Lines  
280.1.10          28       1292   0  

1 details

Started at                    Ended at                      Duration      Name  
Fri Nov 22 16:39:25 UTC 2024  Fri Nov 22 16:39:25 UTC 2024  18µs         0x4aec  
~                             Fri Nov 22 16:41:47 UTC 2024  2m21.423272s  -  
~                             Fri Nov 22 16:41:47 UTC 2024  2m21.421777s  task:28  
Fri Nov 22 16:39:34 UTC 2024  Fri Nov 22 16:39:34 UTC 2024  15.959ms      binding agent state for (zookeeper/76b1b14e-cd1f-47f5-8687-3de160ae6ceb (1)  
~                             Fri Nov 22 16:39:34 UTC 2024  13.54ms       binding agent state for (zookeeper/54cdcdc2-8494-4ad3-9283-034a5047766b (0)  
Fri Nov 22 16:39:36 UTC 2024  Fri Nov 22 16:40:48 UTC 2024  1m11.43286s   canary_update(zookeeper/54cdcdc2-8494-4ad3-9283-034a5047766b (0))  
Fri Nov 22 16:39:55 UTC 2024  Fri Nov 22 16:41:25 UTC 2024  1m30.01514s   task:28-checkpoint  
Fri Nov 22 16:40:48 UTC 2024  Fri Nov 22 16:41:46 UTC 2024  58.793101s    instance_update(zookeeper/76b1b14e-cd1f-47f5-8687-3de160ae6ceb (1))

create_disk was executed twice for the two instances in the test deployment:

$ bosh -e bosh task 28 --cpi | grep create_disk
I, [2024-11-22T16:39:43.168050 #23949]  INFO -- [req_id cpi-557985]: Starting create_disk...
I, [2024-11-22T16:39:44.766623 #23949 #2400] INFO -- [req_id cpi-557985]: create_disk(caching:None;disk_name:bosh-disk-data-3e7a5c9d-cdda-4239-8015-97da81ec3448;resource_group_name:rg-sh-iu, eastus, 32, Standard_LRS, 1)
I, [2024-11-22T16:39:47.263708 #23949]  INFO -- [req_id cpi-557985]: Finished create_disk in 4.1 seconds
I, [2024-11-22T16:40:53.855216 #24038]  INFO -- [req_id cpi-780706]: Starting create_disk...
I, [2024-11-22T16:40:55.463145 #24038 #2400] INFO -- [req_id cpi-780706]: create_disk(caching:None;disk_name:bosh-disk-data-5ce81810-a397-4f17-b647-3cc40cef4349;resource_group_name:rg-sh-iu, eastus, 32, Standard_LRS, 2)
I, [2024-11-22T16:40:58.120556 #24038]  INFO -- [req_id cpi-780706]: Finished create_disk in 4.27 seconds

Looking at the numbers you might be wondering why the deploy times are so similar. This is because the disk creation process only took 4 seconds. The more time-consuming aspect, which wasn't included in my basic testing, involves data transfer, as the disks in my test environment were essentially empty.

It's important to mention for Azure users that while the disk remains accessible during the update, performance may temporarily decrease due to the ongoing background process.

I believe we can consider this issue resolved, @MSSedusch.

s4heid avatar Nov 22 '24 17:11 s4heid

thanks @s4heid for the contribution!

MSSedusch avatar Nov 25 '24 07:11 MSSedusch