skypilot icon indicating copy to clipboard operation
skypilot copied to clipboard

Initializing Azure instances is very slow

Open suquark opened this issue 3 years ago • 6 comments

It takes me 14min to spin up a cluster with 2 cpu nodes.

The most time consuming part is installing pip packages, especially azure-cli. This may be addressed by releasing images with azure-cli pre-installed.

suquark avatar Feb 15 '22 03:02 suquark

+1. I had this slow initialization issue too. I might miss something but why is azure-cli needed to be install on remote VM?

infwinston avatar Feb 15 '22 05:02 infwinston

Because ray-autoscaler is using it. For GCP and AWS, their CLIs are already installed.

suquark avatar Feb 15 '22 07:02 suquark

oh I see. It's used on the head node to further provision resources for worker nodes? Is it correct?

infwinston avatar Feb 15 '22 21:02 infwinston

hmmm, it is mostly used by ray autoscaler for monitoring

suquark avatar Feb 16 '22 01:02 suquark

I tried revisiting this issue briefly. For a cpunode:

  • Launching using Azure web console: about 1.5 min from "create" button to being able to SSH in. Same VM image, region. Only diff being using an existing resource group.
  • Launching using sky launch (which means ray autoscaler, which means Azure python SDK): super slow, ~4-5min from create to SSH; total ~9 min (after installing runtime). Every step is slower than console.

I hacked the template by using the same resource group per region -- no speedup.

So the root cause seems to be Azure's python SDK being much slower than their console. We can take a deeper look.

Typical output

  • ~4-5min from create to SSH
  • ~4min to install runtime
I 08-25 08:58:46 cloud_vm_ray_backend.py:892] To view detailed progress: tail -n100 -f /Users/zongheng/sky_logs/sky-2022-08-25-08-58-44-664631/provision.log
I 08-25 08:58:46 cloud_vm_ray_backend.py:1096] Launching on Azure eastus ()
I 08-25 09:01:53 cloud_vm_ray_backend.py:1131] Retrying head node provisioning due to head fetching timeout.
I 08-25 09:03:40 log_utils.py:45] Head node is up.
I 08-25 09:07:41 cloud_vm_ray_backend.py:984] Successfully provisioned or found existing VM.

concretevitamin avatar Aug 25 '22 16:08 concretevitamin

So the root cause seems to be Azure's python SDK being much slower than their console. We can take a deeper look.

Might be good to verify this hypothesis by using their pure python SDK (without ray autoscaler) to provision a VM and measure time. Here's an example.

romilbhardwaj avatar Aug 25 '22 17:08 romilbhardwaj

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] avatar May 12 '23 21:05 github-actions[bot]

We should also keep this one open unless we are satisfied with the speed with Azure.

infwinston avatar May 13 '23 00:05 infwinston

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] avatar Sep 10 '23 02:09 github-actions[bot]

This issue was closed because it has been stalled for 10 days with no activity.

github-actions[bot] avatar Sep 21 '23 01:09 github-actions[bot]

Can this be re-opened? Still very slow today. For reference, a simple vllm setup takes 18 mins.

WesleyYue avatar Jun 05 '24 17:06 WesleyYue

Related https://github.com/skypilot-org/skypilot/issues/3695

WesleyYue avatar Jul 05 '24 15:07 WesleyYue

This issue should be mitigated by #3704. Closing for now.

Michaelvll avatar Jul 15 '24 20:07 Michaelvll