skyplane icon indicating copy to clipboard operation
skyplane copied to clipboard

Better handling of instance capacity errors

Open sarahwooders opened this issue 1 year ago • 1 comments

Can we re-try instance requests with a smaller number of instances if it fails? Currently I get this error trying to run a transfer:

Error running provision_gateway_instance: An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 4): We currently do not have
sufficient m5.8xlarge capacity in the Availability Zone you requested (us-east-1f). Our system will be working on provisioning additional capacity. You can currently get m5.8xlarge capacity by not
specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b, us-east-1c, us-east-1d.
Traceback (most recent call last):
  File "/Users/sarahwooders/repos/skyplane/env/bin/skyplane", line 33, in <module>
    sys.exit(load_entry_point('skyplane', 'console_scripts', 'skyplane')())
  File "/Users/sarahwooders/repos/skyplane/env/lib/python3.7/site-packages/typer/main.py", line 214, in __call__
    return get_command(self)(*args, **kwargs)
  File "/Users/sarahwooders/repos/skyplane/env/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/Users/sarahwooders/repos/skyplane/env/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/Users/sarahwooders/repos/skyplane/env/lib/python3.7/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/sarahwooders/repos/skyplane/env/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/sarahwooders/repos/skyplane/env/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/Users/sarahwooders/repos/skyplane/env/lib/python3.7/site-packages/typer/main.py", line 500, in wrapper
    return callback(**use_params)  # type: ignore
  File "/Users/sarahwooders/repos/skyplane/skyplane/cli/cli.py", line 209, in cp
    args=args,
  File "/Users/sarahwooders/repos/skyplane/skyplane/cli/cli_impl/cp_replicate.py", line 292, in launch_replication_job
    reuse_gateways, use_bbr=use_bbr, use_compression=use_compression, use_e2ee=use_e2ee, use_socket_tls=use_socket_tls
  File "/Users/sarahwooders/repos/skyplane/skyplane/replicate/replicator_client.py", line 212, in provision_gateways
    desc="Provisioning gateway instances",
  File "/Users/sarahwooders/repos/skyplane/skyplane/utils/fn.py", line 57, in do_parallel
    args, result = future.result()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py", line 428, in result
    return self.__get_result()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/Users/sarahwooders/repos/skyplane/skyplane/utils/fn.py", line 46, in wrapped_fn
    raise e
  File "/Users/sarahwooders/repos/skyplane/skyplane/utils/fn.py", line 43, in wrapped_fn
    return args, func(args)
  File "/Users/sarahwooders/repos/skyplane/skyplane/replicate/replicator_client.py", line 193, in provision_gateway_instance
    server = self.aws.provision_instance(subregion, self.aws_instance_class)
  File "/Users/sarahwooders/repos/skyplane/skyplane/compute/aws/aws_cloud_provider.py", line 414, in provision_instance
    raise e
  File "/Users/sarahwooders/repos/skyplane/skyplane/compute/aws/aws_cloud_provider.py", line 410, in provision_instance
    instance = start_instance(subnets[current_subnet_id].id)
  File "/Users/sarahwooders/repos/skyplane/skyplane/compute/aws/aws_cloud_provider.py", line 401, in start_instance
    InstanceInitiatedShutdownBehavior="terminate",
  File "/Users/sarahwooders/repos/skyplane/env/lib/python3.7/site-packages/boto3/resources/factory.py", line 580, in do_action
    response = action(self, *args, **kwargs)
  File "/Users/sarahwooders/repos/skyplane/env/lib/python3.7/site-packages/boto3/resources/action.py", line 88, in __call__
    response = getattr(parent.meta.client, operation_name)(*args, **params)
  File "/Users/sarahwooders/repos/skyplane/env/lib/python3.7/site-packages/botocore/client.py", line 508, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/Users/sarahwooders/repos/skyplane/env/lib/python3.7/site-packages/botocore/client.py", line 911, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 4): We currently do not have sufficient m5.8xlarge capacity in the Availability Zone you requested (us-east-1f). Our system will be working on provisioning additional capacity. You can currently get m5.8xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b, us-east-1c, us-east-1d.

This was from running a multipart transfer skyplane cp s3://sarah-skylark-us-west-1/big_file.txt/big_file.txt s3://sarah-skylark-us-east-1/big_file.txt --multipart.

sarahwooders avatar Sep 07 '22 00:09 sarahwooders

This is strange since we provision a subnet in all regions so that EC2 has maximum flexibility in choosing an AZ to schedule to. There is already code to retry reprovisioning in another AZ: https://github.com/skyplane-project/skyplane/blob/18e3a6a013c024bbc57676d6f1b5bf79ce709144/skyplane/compute/aws/aws_cloud_provider.py#L422-L437

parasj avatar Oct 04 '22 21:10 parasj

Should be resolved by #772

sarahwooders avatar Mar 14 '23 19:03 sarahwooders