skypilot icon indicating copy to clipboard operation
skypilot copied to clipboard

Failure in provisioning GCP E2/T2A instances

Open WoosukKwon opened this issue 2 years ago • 4 comments

I tested the provisioning of the GCP VMs which will be added in the new GCP catalog (see their specs in https://cloud.google.com/compute/docs/machine-types):

  • M1
  • g1-small (0.5 vCPU)
  • f1-mirco (0.2 vCPU)
  • N2D
  • C2
  • C2D
  • T2D
  • T2A (ARM-based VM)
  • E2

Among theses, I failed to get on-demand E2, on-demand/spot T2A, and on-demand/spot f1-micro. Here are the error messages:

  • On-demand E2 (note that I could get a spot E2 instance though)
I 07-24 21:17:10 cloud_vm_ray_backend.py:1053] Launching on GCP us-west1 (us-west1-a)
I 07-24 21:17:19 cloud_vm_ray_backend.py:513] Got googleapiclient.errors.HttpError: <HttpError 400 when requesting https://compute.googleapis.com/compute/v1/projects/intercloud-320520/zones/us-west1-a/instances?alt=json returned "e2 instances do not support onHostMaintenance=TERMINATE unless they are preemptible.". Details: "[{'message': 'e2 instances do not support onHostMaintenance=TERMINATE unless they are preemptible.', 'domain': 'global', 'reason': 'badRequest'}]">
  • On-demand/spot T2A (same errors)
Traceback (most recent call last):
  File "/Users/woosuk/miniforge3/envs/sky/bin/sky", line 33, in <module>
    sys.exit(load_entry_point('sky', 'console_scripts', 'sky')())
  File "/Users/woosuk/miniforge3/envs/sky/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/Users/woosuk/miniforge3/envs/sky/lib/python3.8/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/Users/woosuk/workspace/sky-proj/sky/sky/utils/common_utils.py", line 108, in _record
    return f(*args, **kwargs)
  File "/Users/woosuk/workspace/sky-proj/sky/sky/cli.py", line 776, in invoke
    return super().invoke(ctx)
  File "/Users/woosuk/miniforge3/envs/sky/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/woosuk/miniforge3/envs/sky/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/woosuk/miniforge3/envs/sky/lib/python3.8/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/Users/woosuk/workspace/sky-proj/sky/sky/utils/common_utils.py", line 129, in _record
    return f(*args, **kwargs)
  File "/Users/woosuk/workspace/sky-proj/sky/sky/cli.py", line 1903, in cpunode
    _create_and_ssh_into_node(
  File "/Users/woosuk/workspace/sky-proj/sky/sky/cli.py", line 528, in _create_and_ssh_into_node
    _launch_with_confirm(
  File "/Users/woosuk/workspace/sky-proj/sky/sky/cli.py", line 464, in _launch_with_confirm
    sky.launch(dag,
  File "/Users/woosuk/workspace/sky-proj/sky/sky/utils/common_utils.py", line 129, in _record
    return f(*args, **kwargs)
  File "/Users/woosuk/workspace/sky-proj/sky/sky/utils/common_utils.py", line 129, in _record
    return f(*args, **kwargs)
  File "/Users/woosuk/workspace/sky-proj/sky/sky/execution.py", line 212, in launch
    _execute(
  File "/Users/woosuk/workspace/sky-proj/sky/sky/execution.py", line 139, in _execute
    handle = backend.provision(task,
  File "/Users/woosuk/workspace/sky-proj/sky/sky/utils/common_utils.py", line 129, in _record
    return f(*args, **kwargs)
  File "/Users/woosuk/workspace/sky-proj/sky/sky/utils/common_utils.py", line 108, in _record
    return f(*args, **kwargs)
  File "/Users/woosuk/workspace/sky-proj/sky/sky/backends/backend.py", line 49, in provision
    return self._provision(task, to_provision, dryrun, stream_logs,
  File "/Users/woosuk/workspace/sky-proj/sky/sky/backends/cloud_vm_ray_backend.py", line 1498, in _provision
    config_dict = provisioner.provision_with_retries(
  File "/Users/woosuk/workspace/sky-proj/sky/sky/utils/common_utils.py", line 129, in _record
    return f(*args, **kwargs)
  File "/Users/woosuk/workspace/sky-proj/sky/sky/backends/cloud_vm_ray_backend.py", line 1190, in provision_with_retries
    config_dict = self._retry_region_zones(
  File "/Users/woosuk/workspace/sky-proj/sky/sky/backends/cloud_vm_ray_backend.py", line 959, in _retry_region_zones
    self._update_blocklist_on_error(to_provision.cloud, region,
  File "/Users/woosuk/workspace/sky-proj/sky/sky/backends/cloud_vm_ray_backend.py", line 648, in _update_blocklist_on_error
    return self._update_blocklist_on_gcp_error(region, zones, stdout,
  File "/Users/woosuk/workspace/sky-proj/sky/sky/backends/cloud_vm_ray_backend.py", line 470, in _update_blocklist_on_gcp_error
    exception_dict = ast.literal_eval(exception_str)
  File "/Users/woosuk/miniforge3/envs/sky/lib/python3.8/ast.py", line 59, in literal_eval
    node_or_string = parse(node_or_string, mode='eval')
  File "/Users/woosuk/miniforge3/envs/sky/lib/python3.8/ast.py", line 47, in parse
    return compile(source, filename, mode, flags,
  File "<unknown>", line 1
    wait_ready timeout exceeded.
               ^
SyntaxError: invalid syntax
  • on-demand/spot f1-micro (no response after the head node is up)
I 07-24 21:52:03 cloud_vm_ray_backend.py:1053] Launching on GCP us-west1 (us-west1-a)
I 07-24 21:53:10 log_utils.py:45] Head node is up.

We may have to test and document the unsupported machine types.

WoosukKwon avatar Jul 25 '22 04:07 WoosukKwon

Nice catch @WoosukKwon! Wdyt about leaving the unsupported VM types out of the catalog for now? This way users can get a nicer "VM type not found/supported" error.

In the future we can also consider adding a CLI that shows all supported VM types by looking into the catalogs.

concretevitamin avatar Jul 25 '22 04:07 concretevitamin

@concretevitamin Sounds good. Then I think we can keep E2 and only remove t2a and f1-micro VMs. WDYT?

WoosukKwon avatar Jul 25 '22 05:07 WoosukKwon

@concretevitamin Sounds good. Then I think we can keep E2 and only remove t2a and f1-micro VMs. WDYT?

Sounds good.

concretevitamin avatar Jul 25 '22 05:07 concretevitamin

@concretevitamin Added the filter in PR #1004.

WoosukKwon avatar Jul 25 '22 06:07 WoosukKwon