skypilot icon indicating copy to clipboard operation
skypilot copied to clipboard

[Provisioner] Multi-node provisioning for large clusters fails on GCP

Open romilbhardwaj opened this issue 3 years ago • 3 comments
trafficstars

Provisioning a multi-node cluster with num_nodes ~50 fails on GCP. Works on AWS. Works on GCP for 2 nodes.

Example YAML:

resources:
  cloud: gcp

num_nodes: 50

run: |
echo $HOSTNAME
Task from YAML spec: provision_tutorial_docker.yaml
I 10-16 16:21:49 optimizer.py:605] == Optimizer ==
I 10-16 16:21:49 optimizer.py:628] Estimated cost: $23.7 / hour
I 10-16 16:21:49 optimizer.py:628] 
I 10-16 16:21:49 optimizer.py:685] Considered resources (50 nodes):
I 10-16 16:21:49 optimizer.py:713] -------------------------------------------------------------------
I 10-16 16:21:49 optimizer.py:713]  CLOUD   INSTANCE       vCPUs   ACCELERATORS   COST ($)   CHOSEN   
I 10-16 16:21:49 optimizer.py:713] -------------------------------------------------------------------
I 10-16 16:21:49 optimizer.py:713]  GCP     n1-highmem-8   8       -              23.66         ✔     
I 10-16 16:21:49 optimizer.py:713] -------------------------------------------------------------------
I 10-16 16:21:49 optimizer.py:713] 
Launching a new cluster 'tut'. Proceed? [Y/n]: Y
I 10-16 16:21:56 cloud_vm_ray_backend.py:2705] Creating a new cluster: "tut" [50x GCP(n1-highmem-8)].
I 10-16 16:21:56 cloud_vm_ray_backend.py:2705] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 10-16 16:21:56 cloud_vm_ray_backend.py:903] To view detailed progress: tail -n100 -f /home/romilb/sky_logs/sky-2022-10-16-16-21-48-994226/provision.log
I 10-16 16:22:00 cloud_vm_ray_backend.py:1153] Launching on GCP us-central1 (us-central1-a)
I 10-16 16:23:08 log_utils.py:45] Head node is up.
I 10-16 16:24:14 cloud_vm_ray_backend.py:1260] Successfully provisioned or found existing head VM. Waiting for workers.
E 10-16 16:25:32 backend_utils.py:914] Timed out: waited for more than 60 seconds for new workers to be provisioned, but no progress.
E 10-16 16:25:32 cloud_vm_ray_backend.py:1031] *** Failed provisioning the cluster. ***
E 10-16 16:25:32 cloud_vm_ray_backend.py:1033] *** Terminating the failed cluster. ***
I 10-16 16:38:37 cloud_vm_ray_backend.py:1153] Launching on GCP us-west1 (us-west1-a)
I 10-16 16:39:49 log_utils.py:45] Head node is up.
I 10-16 16:40:51 cloud_vm_ray_backend.py:1260] Successfully provisioned or found existing head VM. Waiting for workers.
E 10-16 16:42:09 backend_utils.py:914] Timed out: waited for more than 60 seconds for new workers to be provisioned, but no progress.
E 10-16 16:42:09 cloud_vm_ray_backend.py:1031] *** Failed provisioning the cluster. ***
E 10-16 16:42:09 cloud_vm_ray_backend.py:1033] *** Terminating the failed cluster. ***

I can see the workers ready on the google cloud console, but the terminal shows 0 of 49 workers ready.... ray status also shows waiting for workers (provisioning logs).

Might be related to too short a timeout here? https://github.com/skypilot-org/skypilot/blob/2d4fee3a30ad0a7c8a5a7a7e0f5ff62f332a9708/sky/backends/cloud_vm_ray_backend.py#L63-L65

romilbhardwaj avatar Oct 17 '22 00:10 romilbhardwaj

Great finding @romilbhardwaj ! I just tried it out with infinite _NODES_LAUNCHING_PROGRESS_TIMEOUT, and it seems that ray status takes 6 minutes to find out the first 10 workers have been successfully added to the ray cluster here.

It is a bit tricky to increase the number, as that will increase the overhead for failing over to another region. For example, we increase the _NODES_LAUNCHING_PROGRESS_TIMEOUT to 10 minutes. If a user tries to launch 5 nodes and we assume that each worker node becomes available at the last minute, but the last one fails to due to quota issue (we don't have a way to distinguish the quota issue and the resource unavailable for multi-node at the moment), then it will take head_provision_time + 10 * 4 > 40 minutes before our system starts to fail over to another region, i.e. it can take more than 10 * 40 min > 6 hours to go through us regions in GCP.

Michaelvll avatar Oct 17 '22 04:10 Michaelvll

Good point! I'm tempted to suggest using google cloud python API to check worker status (how many have been provisioned) since those are fairly fast, and make a early decision based on that instead of waiting for ray to report ready worker count. However that might be a little brittle since we're side-stepping the ray autoscaler...

romilbhardwaj avatar Oct 17 '22 06:10 romilbhardwaj

Another problem found when trying out this issue. Our sky down cluster_name for the 50 node cluster does not work well producing the following error.

E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] Failed to terminate large-scale-50. If you want to ignore this error and remove the cluster from the status table, use `sky down --purge`.
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] **** STDOUT ****
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:50:36,235  INFO commands.py:348 -- Checking External environment settings
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:50:40,638  INFO commands.py:403 -- Destroying cluster. Confirm [y/N]: y [automatic, due to --yes]
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:50:40,650  VWARN commands.py:317 -- Loaded cached provider configuration from /var/folders/bx/cnr_jmsj4n79c9p0q4v68xvw0000gp/T/ray-config-8b1e38e306017a2bfb0f13baf6f80f8b9d6c358b
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:50:40,651  WARN commands.py:323 -- If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:50:43,175  INFO command_runner.py:394 -- Fetched IP: 34.172.138.58
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:50:43,177  VINFO command_runner.py:552 -- Running `ray stop`
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:50:43,177  VVINFO command_runner.py:554 -- Full command is `ssh -tt -i ~/.ssh/sky-key -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_638d9623ae/12b6a76eef/%C -o ControlPersist=10s -o ConnectTimeout=120s [email protected] bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (ray stop)'`
Stopped all 7 Ray processes.y_backend.py:2545] 1/7 stopped.
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] **** STDERR ****
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:50:35,625  INFO util.py:335 -- setting max workers for head node type to 0
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:50:40,650  INFO util.py:335 -- setting max workers for head node type to 0
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] Warning: Permanently added '34.172.138.58' (ED25519) to the list of known hosts.
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] Shared connection to 34.172.138.58 closed.
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:50:52,246  INFO node_provider.py:162 -- NodeProvider: ray-large-scale-50-head-4cb2c524-compute: Terminating node
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:51:08,761  INFO node_provider.py:162 -- NodeProvider: ray-large-scale-50-worker-00ff3de2-compute: Terminating node
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:52:12,209  INFO node_provider.py:162 -- NodeProvider: ray-large-scale-50-worker-028f54f7-compute: Terminating node
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:53:16,490  INFO node_provider.py:162 -- NodeProvider: ray-large-scale-50-worker-0ddbb16e-compute: Terminating node
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:54:21,075  INFO node_provider.py:162 -- NodeProvider: ray-large-scale-50-worker-0eeba840-compute: Terminating node
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:55:25,889  INFO node_provider.py:162 -- NodeProvider: ray-large-scale-50-worker-1ba846d1-compute: Terminating node
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:56:29,988  INFO node_provider.py:162 -- NodeProvider: ray-large-scale-50-worker-3927d51f-compute: Terminating node
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:57:34,198  INFO node_provider.py:162 -- NodeProvider: ray-large-scale-50-worker-3eb35fc8-compute: Terminating node
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:58:38,301  INFO node_provider.py:162 -- NodeProvider: ray-large-scale-50-worker-4d9e523b-compute: Terminating node
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 21:59:41,991  INFO node_provider.py:162 -- NodeProvider: ray-large-scale-50-worker-4edccf2f-compute: Terminating node
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 2022-10-16 22:00:45,790  INFO node_provider.py:162 -- NodeProvider: ray-large-scale-50-worker-5b4b9871-compute: Terminating node
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] Traceback (most recent call last):
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/bin/ray", line 8, in <module>
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     sys.exit(main())
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/ray/scripts/scripts.py", line 2339, in main
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     return cli()
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     return self.main(*args, **kwargs)
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/click/core.py", line 1053, in main
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     rv = self.invoke(ctx)
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     return _process_result(sub_ctx.command.invoke(sub_ctx))
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     return ctx.invoke(self.callback, **ctx.params)
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/click/core.py", line 754, in invoke
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     return __callback(*args, **kwargs)
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/ray/autoscaler/_private/cli_logger.py", line 852, in wrapper
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     return f(*args, **kwargs)
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/ray/scripts/scripts.py", line 1223, in down
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     teardown_cluster(
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/ray/autoscaler/_private/commands.py", line 510, in teardown_cluster
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     provider.terminate_nodes(A)
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/ray/autoscaler/node_provider.py", line 163, in terminate_nodes
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     self.terminate_node(node_id)
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/Library/CloudStorage/OneDrive-Personal/AResource/PhD/Research/sky-computing/code/sky-experiment-dev/sky/skylet/providers/gcp/node_provider.py", line 54, in method_with_retries
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     return method(self, *args, **kwargs)
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/Library/CloudStorage/OneDrive-Personal/AResource/PhD/Research/sky-computing/code/sky-experiment-dev/sky/skylet/providers/gcp/node_provider.py", line 257, in terminate_node
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     result = resource.delete_instance(
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/Library/CloudStorage/OneDrive-Personal/AResource/PhD/Research/sky-computing/code/sky-experiment-dev/sky/skylet/providers/gcp/node.py", line 598, in delete_instance
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     self.resource.instances()
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     return wrapped(*args, **kwargs)
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/googleapiclient/http.py", line 923, in execute
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     resp, content = _retry_request(
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/googleapiclient/http.py", line 222, in _retry_request
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     raise exception
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/googleapiclient/http.py", line 191, in _retry_request
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     resp, content = http.request(uri, method, *args, **kwargs)
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/google_auth_httplib2.py", line 218, in request
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     response, content = self.http.request(
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/httplib2/__init__.py", line 1725, in request
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     (response, content) = self._request(
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/httplib2/__init__.py", line 1441, in _request
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     (response, content) = self._conn_request(conn, request_uri, method, body, headers)
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/httplib2/__init__.py", line 1393, in _conn_request
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     response = conn.getresponse()
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/http/client.py", line 1348, in getresponse
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     response.begin()
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/http/client.py", line 316, in begin
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     version, status, reason = self._read_status()
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/http/client.py", line 277, in _read_status
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/socket.py", line 669, in readinto
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     return self._sock.recv_into(b)
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/ssl.py", line 1241, in recv_into
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     return self.read(nbytes, buffer)
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]   File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/ssl.py", line 1099, in read
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545]     return self._sslobj.read(len, buffer)
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] socket.timeout: The read operation timed out
E 10-16 22:01:45 cloud_vm_ray_backend.py:2545] 

Michaelvll avatar Oct 17 '22 16:10 Michaelvll

I hit this error today trying to launch a cluster with 16 nodes. My yaml config is set up to download a docker image from a private gcr registry. Downloading the image take a couple minutes, although I'm not sure whether this is the cause for the time out.

Terminating the cluster also takes longer than it should since it is stopping/deleting the gcp instances one by one, when it could do it in parallel (resulting in extra cost for the idle instances)

juancamilog avatar Mar 31 '23 22:03 juancamilog

Thanks for the report @juancamilog. cc @suquark: according to the discussions above it seems like having a GCP impl of the new provisioner interface would solve this problem. Do you see an easy way to fix it in the current provisioner?

concretevitamin avatar Apr 01 '23 04:04 concretevitamin

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] avatar Jul 31 '23 02:07 github-actions[bot]

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] avatar Nov 29 '23 02:11 github-actions[bot]

This is being solved by #2681

Michaelvll avatar Nov 29 '23 02:11 Michaelvll