skypilot
skypilot copied to clipboard
Initializing Azure instances is very slow
It takes me 14min to spin up a cluster with 2 cpu nodes.
The most time consuming part is installing pip packages, especially azure-cli
. This may be addressed by releasing images with azure-cli
pre-installed.
+1. I had this slow initialization issue too. I might miss something but why is azure-cli
needed to be install on remote VM?
Because ray-autoscaler is using it. For GCP and AWS, their CLIs are already installed.
oh I see. It's used on the head node to further provision resources for worker nodes? Is it correct?
hmmm, it is mostly used by ray autoscaler for monitoring
I tried revisiting this issue briefly. For a cpunode:
- Launching using Azure web console: about 1.5 min from "create" button to being able to SSH in. Same VM image, region. Only diff being using an existing resource group.
- Launching using
sky launch
(which means ray autoscaler, which means Azure python SDK): super slow, ~4-5min from create to SSH; total ~9 min (after installing runtime). Every step is slower than console.
I hacked the template by using the same resource group per region -- no speedup.
So the root cause seems to be Azure's python SDK being much slower than their console. We can take a deeper look.
Typical output
- ~4-5min from create to SSH
- ~4min to install runtime
I 08-25 08:58:46 cloud_vm_ray_backend.py:892] To view detailed progress: tail -n100 -f /Users/zongheng/sky_logs/sky-2022-08-25-08-58-44-664631/provision.log
I 08-25 08:58:46 cloud_vm_ray_backend.py:1096] Launching on Azure eastus ()
I 08-25 09:01:53 cloud_vm_ray_backend.py:1131] Retrying head node provisioning due to head fetching timeout.
I 08-25 09:03:40 log_utils.py:45] Head node is up.
I 08-25 09:07:41 cloud_vm_ray_backend.py:984] Successfully provisioned or found existing VM.
So the root cause seems to be Azure's python SDK being much slower than their console. We can take a deeper look.
Might be good to verify this hypothesis by using their pure python SDK (without ray autoscaler) to provision a VM and measure time. Here's an example.
This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.
We should also keep this one open unless we are satisfied with the speed with Azure.
This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.
This issue was closed because it has been stalled for 10 days with no activity.
Can this be re-opened? Still very slow today. For reference, a simple vllm setup takes 18 mins.
Related https://github.com/skypilot-org/skypilot/issues/3695
This issue should be mitigated by #3704. Closing for now.