skypilot
skypilot copied to clipboard
Failure in provisioning GCP E2/T2A instances
I tested the provisioning of the GCP VMs which will be added in the new GCP catalog (see their specs in https://cloud.google.com/compute/docs/machine-types):
- M1
- g1-small (0.5 vCPU)
- f1-mirco (0.2 vCPU)
- N2D
- C2
- C2D
- T2D
- T2A (ARM-based VM)
- E2
Among theses, I failed to get on-demand E2, on-demand/spot T2A, and on-demand/spot f1-micro. Here are the error messages:
- On-demand E2 (note that I could get a spot E2 instance though)
I 07-24 21:17:10 cloud_vm_ray_backend.py:1053] Launching on GCP us-west1 (us-west1-a)
I 07-24 21:17:19 cloud_vm_ray_backend.py:513] Got googleapiclient.errors.HttpError: <HttpError 400 when requesting https://compute.googleapis.com/compute/v1/projects/intercloud-320520/zones/us-west1-a/instances?alt=json returned "e2 instances do not support onHostMaintenance=TERMINATE unless they are preemptible.". Details: "[{'message': 'e2 instances do not support onHostMaintenance=TERMINATE unless they are preemptible.', 'domain': 'global', 'reason': 'badRequest'}]">
- On-demand/spot T2A (same errors)
Traceback (most recent call last):
File "/Users/woosuk/miniforge3/envs/sky/bin/sky", line 33, in <module>
sys.exit(load_entry_point('sky', 'console_scripts', 'sky')())
File "/Users/woosuk/miniforge3/envs/sky/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/Users/woosuk/miniforge3/envs/sky/lib/python3.8/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/Users/woosuk/workspace/sky-proj/sky/sky/utils/common_utils.py", line 108, in _record
return f(*args, **kwargs)
File "/Users/woosuk/workspace/sky-proj/sky/sky/cli.py", line 776, in invoke
return super().invoke(ctx)
File "/Users/woosuk/miniforge3/envs/sky/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/woosuk/miniforge3/envs/sky/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/woosuk/miniforge3/envs/sky/lib/python3.8/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/Users/woosuk/workspace/sky-proj/sky/sky/utils/common_utils.py", line 129, in _record
return f(*args, **kwargs)
File "/Users/woosuk/workspace/sky-proj/sky/sky/cli.py", line 1903, in cpunode
_create_and_ssh_into_node(
File "/Users/woosuk/workspace/sky-proj/sky/sky/cli.py", line 528, in _create_and_ssh_into_node
_launch_with_confirm(
File "/Users/woosuk/workspace/sky-proj/sky/sky/cli.py", line 464, in _launch_with_confirm
sky.launch(dag,
File "/Users/woosuk/workspace/sky-proj/sky/sky/utils/common_utils.py", line 129, in _record
return f(*args, **kwargs)
File "/Users/woosuk/workspace/sky-proj/sky/sky/utils/common_utils.py", line 129, in _record
return f(*args, **kwargs)
File "/Users/woosuk/workspace/sky-proj/sky/sky/execution.py", line 212, in launch
_execute(
File "/Users/woosuk/workspace/sky-proj/sky/sky/execution.py", line 139, in _execute
handle = backend.provision(task,
File "/Users/woosuk/workspace/sky-proj/sky/sky/utils/common_utils.py", line 129, in _record
return f(*args, **kwargs)
File "/Users/woosuk/workspace/sky-proj/sky/sky/utils/common_utils.py", line 108, in _record
return f(*args, **kwargs)
File "/Users/woosuk/workspace/sky-proj/sky/sky/backends/backend.py", line 49, in provision
return self._provision(task, to_provision, dryrun, stream_logs,
File "/Users/woosuk/workspace/sky-proj/sky/sky/backends/cloud_vm_ray_backend.py", line 1498, in _provision
config_dict = provisioner.provision_with_retries(
File "/Users/woosuk/workspace/sky-proj/sky/sky/utils/common_utils.py", line 129, in _record
return f(*args, **kwargs)
File "/Users/woosuk/workspace/sky-proj/sky/sky/backends/cloud_vm_ray_backend.py", line 1190, in provision_with_retries
config_dict = self._retry_region_zones(
File "/Users/woosuk/workspace/sky-proj/sky/sky/backends/cloud_vm_ray_backend.py", line 959, in _retry_region_zones
self._update_blocklist_on_error(to_provision.cloud, region,
File "/Users/woosuk/workspace/sky-proj/sky/sky/backends/cloud_vm_ray_backend.py", line 648, in _update_blocklist_on_error
return self._update_blocklist_on_gcp_error(region, zones, stdout,
File "/Users/woosuk/workspace/sky-proj/sky/sky/backends/cloud_vm_ray_backend.py", line 470, in _update_blocklist_on_gcp_error
exception_dict = ast.literal_eval(exception_str)
File "/Users/woosuk/miniforge3/envs/sky/lib/python3.8/ast.py", line 59, in literal_eval
node_or_string = parse(node_or_string, mode='eval')
File "/Users/woosuk/miniforge3/envs/sky/lib/python3.8/ast.py", line 47, in parse
return compile(source, filename, mode, flags,
File "<unknown>", line 1
wait_ready timeout exceeded.
^
SyntaxError: invalid syntax
- on-demand/spot f1-micro (no response after the head node is up)
I 07-24 21:52:03 cloud_vm_ray_backend.py:1053] Launching on GCP us-west1 (us-west1-a)
I 07-24 21:53:10 log_utils.py:45] Head node is up.
We may have to test and document the unsupported machine types.
Nice catch @WoosukKwon! Wdyt about leaving the unsupported VM types out of the catalog for now? This way users can get a nicer "VM type not found/supported" error.
In the future we can also consider adding a CLI that shows all supported VM types by looking into the catalogs.
@concretevitamin Sounds good. Then I think we can keep E2 and only remove t2a and f1-micro VMs. WDYT?
@concretevitamin Sounds good. Then I think we can keep E2 and only remove t2a and f1-micro VMs. WDYT?
Sounds good.
@concretevitamin Added the filter in PR #1004.