skypilot icon indicating copy to clipboard operation
skypilot copied to clipboard

[TPU VM] Attaching & Mounting Persistent Disk

Open jackyk02 opened this issue 9 months ago • 0 comments

Issue Reference: #2778

When launching a TPU VM with sky launch tpu_vm.yaml -c mycluster --cloud gcp --disk-size 200, the resulting VM is still initialized with a disk size of 100 GB (default size). Users have to add a persistent disk to expand their local disk capacity as the boot disk of TPU VMs is not resizable.

tpu_vm.yaml:

resources:
   accelerators: tpu-v2-8
   accelerator_args:
      runtime_version: tpu-vm-base

Solution We currently use the Cloud TPU API for managing TPUVMs (e.g. create_instance, set_labels, and delete_instance). However, this API lacks functionality for disk attachment. Therefore, this PR includes using the GCP CLI to attach a persistent disk to TPU VMs (Documentation).

Test 1:

  1. Launch the TPUVM with a specified disk size: sky launch tpu_vm.yaml -c mycluster --cloud gcp --disk-size 200 sky stop mucluster

  2. Restart the TPUVM with a specified disk size: sky launch tpu_vm.yaml -c mycluster --cloud gcp --disk-size 200

  3. Verified that a extra disk with size 100GB has been created and attached to the TPUVM

  4. Ensured that disk is mounted under the path /mnt/disks/persist

Test 2:

  1. Relaunch the TPUVM multiple times
  2. Received Error: Disk creation failed: The resource projects/project_name/zones/zone_name/disks/mycluster-d9a3-tpu-extra-disk' already exists

Test 3:

  1. Launch the TPUVM with a disk size that is less than 100: sky launch tpu_vm.yaml -c mycluster --cloud gcp --disk-size 80 sky stop mucluster

  2. Restart the TPUVM with a specified disk size: sky launch tpu_vm.yaml -c mycluster --cloud gcp --disk-size 80

  3. Verified that no extra disk has been created

Test 4: pytest tests/test_smoke.py --tpu

Note:

  1. Disk attachment only takes effect when the cluster is restarted.

jackyk02 avatar Apr 29 '24 03:04 jackyk02