MLOS icon indicating copy to clipboard operation
MLOS copied to clipboard

Separate "teardown" from "cleanup" phases

Open bpkroth opened this issue 2 years ago • 1 comments
trafficstars

In the AzureVMService right now we have a teardown phase that calls vm_deprovision (soon to be deprovision_host).

However, it doesn't actually deprovision the host. Rather, it currently calls deallocate: https://github.com/microsoft/MLOS/blob/83497e2a04aa3c367c2c2579a80ca7019f2a379f/mlos_bench/mlos_bench/services/remote/azure/azure_services.py#L74-L81 which simply releases the VM's node binding (thus freeing Compute cost from incurring), but it doesn't actually remove the VM disks, NIC, IPs, etc. (which have their own cost)

I think there's value in having a "teardown" mode that simply shutsdown and deallocates the VM (as an aside, shutting down the VM inside the guest OS, doesn't deallocate the VM) without removing it entirely.

Additionally, some non-VM related resources could use "teardown" phases to manage cleaning up after a workload run, so the --no-teardown option isn't super great for that either.

Moreover, as mentioned in #468, there are other needs for more complete cleanup scripts to make sure we aren't over spending when we're done with experiment resources.

I think an approach we could take is to separate more completely the "teardown" from "cleanup" phases, the latter potentially only invoked manually when an experiment is deemed to be over, whereas we may want to do "deallocate" more frequently.

"cleanup" should probably only handle certain things like VMs and disks, but not shared storage, shared DB, shared VNets, etc.

bpkroth avatar Aug 25 '23 22:08 bpkroth

See also #404

bpkroth avatar Oct 11 '23 16:10 bpkroth