skypilot
skypilot copied to clipboard
[TPU VM] Attaching & Mounting Persistent Disk
Issue Reference: #2778
When launching a TPU VM with sky launch tpu_vm.yaml -c mycluster --cloud gcp --disk-size 200
, the resulting VM is still initialized with a disk size of 100 GB (default size). Users have to add a persistent disk to expand their local disk capacity as the boot disk of TPU VMs is not resizable.
tpu_vm.yaml:
resources:
accelerators: tpu-v2-8
accelerator_args:
runtime_version: tpu-vm-base
Solution
We currently use the Cloud TPU API for managing TPUVMs (e.g. create_instance, set_labels, and delete_instance
). However, this API lacks functionality for disk attachment. Therefore, this PR includes using the GCP CLI to attach a persistent disk to TPU VMs (Documentation).
Test 1:
-
Launch the TPUVM with a specified disk size:
sky launch tpu_vm.yaml -c mycluster --cloud gcp --disk-size 200
sky stop mucluster
-
Restart the TPUVM with a specified disk size:
sky launch tpu_vm.yaml -c mycluster --cloud gcp --disk-size 200
-
Verified that a extra disk with size 100GB has been created and attached to the TPUVM
-
Ensured that disk is mounted under the path
/mnt/disks/persist
Test 2:
- Relaunch the TPUVM multiple times
- Received Error:
Disk creation failed: The resource projects/project_name/zones/zone_name/disks/mycluster-d9a3-tpu-extra-disk' already exists
Test 3:
-
Launch the TPUVM with a disk size that is less than 100:
sky launch tpu_vm.yaml -c mycluster --cloud gcp --disk-size 80
sky stop mucluster
-
Restart the TPUVM with a specified disk size:
sky launch tpu_vm.yaml -c mycluster --cloud gcp --disk-size 80
-
Verified that no extra disk has been created
Test 4:
pytest tests/test_smoke.py --tpu
Note:
- Disk attachment only takes effect when the cluster is restarted.