amazon-ray icon indicating copy to clipboard operation
amazon-ray copied to clipboard

[autoscaler] Improve experience when EC2 does not have capacity for worker nodes

Open jennakwon06 opened this issue 4 years ago • 2 comments

Hello -

After I spinned up the cluster with ray up my_cluster.yaml, my workload wasn't really getting handled well by the Ray cluster. I tried ray monitor my_cluster.yaml then found out that the logs were flooded with below messages:

==> /tmp/ray/session_latest/logs/monitor.err <==
ssh: connect to host 10.0.80.64 port 22: Connection timed out

==> /tmp/ray/session_latest/logs/monitor.log <==
2021-02-10 05:24:28,068 INFO node_launcher.py:78 -- NodeLauncher1: Got 5 nodes to launch.
2021-02-10 05:24:28,186 INFO node_launcher.py:78 -- NodeLauncher1: Launching 5 nodes, type ray-legacy-worker-node-type.

==> /tmp/ray/session_latest/logs/monitor.out <==
2021-02-10 05:24:27,659 INFO node_provider.py:408 -- create_instances: Attempt failed with An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient r5n.24xlarge capacity in the Availability Zone you requested (us-west-2b). Our system will be working on provisioning additional capacity. You can currently get r5n.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-west-2a, us-west-2c., retrying.
2021-02-10 05:24:27,683 INFO node_provider.py:408 -- create_instances: Attempt failed with An error occurred (RequestLimitExceeded) when calling the RunInstances operation (reached max retries: 0): Request limit exceeded., retrying.
2021-02-10 05:24:27,715 INFO node_provider.py:378 -- Launched 5 nodes [subnet_id=subnet-0180e9267b994bf97]
2021-02-10 05:24:27,715 INFO node_provider.py:397 -- Launched instance i-03fb4297fc5f3f1cd [state=pending, info=pending]
2021-02-10 05:24:27,789 INFO updater.py:273 -- SSH still not available (SSH command failed.), retrying in 5 seconds.

==> /tmp/ray/session_latest/logs/monitor.log <==
2021-02-10 05:24:28,538 ERROR node_launcher.py:72 -- Launch failed
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.py", line 370, in _create_node
    created = self.ec2_fail_fast.create_instances(**conf)
  File "/usr/local/lib/python3.7/site-packages/boto3/resources/factory.py", line 520, in do_action
    response = action(self, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/boto3/resources/action.py", line 83, in __call__
    response = getattr(parent.meta.client, operation_name)(*args, **params)
  File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 357, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 676, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (Unsupported) when calling the RunInstances operation: Your requested instance type (r5n.24xlarge) is not supported in your requested Availability Zone (us-west-2d). Please retry your request by not specifying an Availability Zone or choosing us-west-2a, us-west-2b, us-west-2c.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/ray/autoscaler/_private/node_launcher.py", line 70, in run
    self._launch_node(config, count, node_type)
  File "/usr/local/lib/python3.7/site-packages/ray/autoscaler/_private/node_launcher.py", line 60, in _launch_node
    self.provider.create_node(node_config, node_tags, count)
  File "/usr/local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.py", line 311, in create_node
    created_nodes_dict = self._create_node(node_config, tags, count)
  File "/usr/local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.py", line 403, in _create_node
    "Failed to launch instances. Max attempts exceeded.")
  File "/usr/local/lib/python3.7/site-packages/ray/autoscaler/_private/cli_logger.py", line 585, in abort
    raise exc_cls("Exiting due to cli_logger.abort()")
click.exceptions.ClickException: Exiting due to cli_logger.abort()
2021-02-10 05:24:28,538 INFO node_launcher.py:78 -- NodeLauncher0: Got 2 nodes to launch.
2021-02-10 05:24:28,774 INFO node_launcher.py:78 -- NodeLauncher0: Launching 2 nodes, type ray-legacy-worker-node-type.

==> /tmp/ray/session_latest/logs/monitor.out <==
2021-02-10 05:24:28,538 PANIC node_provider.py:403 -- Failed to launch instances. Max attempts exceeded.
2021-02-10 05:24:29,000 INFO node_provider.py:408 -- create_instances: Attempt failed with An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient r5n.24xlarge capacity in the Availability Zone you requested (us-west-2a). Our system will be working on provisioning additional capacity. You can currently get r5n.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-west-2b, us-west-2c., retrying.
2021-02-10 05:24:29,023 INFO node_provider.py:408 -- create_instances: Attempt failed with An error occurred (RequestLimitExceeded) when calling the RunInstances operation (reached max retries: 0): Request limit exceeded., retrying.

So it looks like the requested instance type isn't really available by EC2, and Ray isn't able to spin up desired worker nodes. This means I need to run ray down, modify the my_cluster.yaml, then retry with a different instance type.

I was wondering if we can improve this experience. Perhaps check the EC2 instance capacity for at least minimum # of workers before telling the user that cluster is launched? Or perhaps let user specify list of instance types that they're OK with?

jennakwon06 avatar Feb 10 '21 05:02 jennakwon06

An example showing how to specify multiple candidate instance types can be seen at: https://github.com/amzn/amazon-ray/blob/main/python/ray/autoscaler/aws/example-multi-node-type.yaml

Out of curiosity, did you specify any quantity for min_workers in your autoscaler config? I think the existing behavior is probably OK in the event that Ray is unable to spin your cluster up to the desired max_workers capacity, but I would agree that it's misleading to tell a user that their cluster has been successfully launched after running ray up if we haven't yet started the provisioning process for the requested count of min_workers.

pdames avatar Feb 10 '21 07:02 pdames

Yes my min_workers was 34, and max_workers was 34 (for debugging purposes I had set it like that). I was perplexed when I couldn't see any workers on EC2 console!

jennakwon06 avatar Feb 11 '21 21:02 jennakwon06