amazon-ray
amazon-ray copied to clipboard
[autoscaler] Improve experience when EC2 does not have capacity for worker nodes
Hello -
After I spinned up the cluster with ray up my_cluster.yaml
, my workload wasn't really getting handled well by the Ray cluster. I tried ray monitor my_cluster.yaml
then found out that the logs were flooded with below messages:
==> /tmp/ray/session_latest/logs/monitor.err <==
ssh: connect to host 10.0.80.64 port 22: Connection timed out
==> /tmp/ray/session_latest/logs/monitor.log <==
2021-02-10 05:24:28,068 INFO node_launcher.py:78 -- NodeLauncher1: Got 5 nodes to launch.
2021-02-10 05:24:28,186 INFO node_launcher.py:78 -- NodeLauncher1: Launching 5 nodes, type ray-legacy-worker-node-type.
==> /tmp/ray/session_latest/logs/monitor.out <==
2021-02-10 05:24:27,659 INFO node_provider.py:408 -- create_instances: Attempt failed with An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient r5n.24xlarge capacity in the Availability Zone you requested (us-west-2b). Our system will be working on provisioning additional capacity. You can currently get r5n.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-west-2a, us-west-2c., retrying.
2021-02-10 05:24:27,683 INFO node_provider.py:408 -- create_instances: Attempt failed with An error occurred (RequestLimitExceeded) when calling the RunInstances operation (reached max retries: 0): Request limit exceeded., retrying.
2021-02-10 05:24:27,715 INFO node_provider.py:378 -- Launched 5 nodes [subnet_id=subnet-0180e9267b994bf97]
2021-02-10 05:24:27,715 INFO node_provider.py:397 -- Launched instance i-03fb4297fc5f3f1cd [state=pending, info=pending]
2021-02-10 05:24:27,789 INFO updater.py:273 -- SSH still not available (SSH command failed.), retrying in 5 seconds.
==> /tmp/ray/session_latest/logs/monitor.log <==
2021-02-10 05:24:28,538 ERROR node_launcher.py:72 -- Launch failed
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.py", line 370, in _create_node
created = self.ec2_fail_fast.create_instances(**conf)
File "/usr/local/lib/python3.7/site-packages/boto3/resources/factory.py", line 520, in do_action
response = action(self, *args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/boto3/resources/action.py", line 83, in __call__
response = getattr(parent.meta.client, operation_name)(*args, **params)
File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 357, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 676, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (Unsupported) when calling the RunInstances operation: Your requested instance type (r5n.24xlarge) is not supported in your requested Availability Zone (us-west-2d). Please retry your request by not specifying an Availability Zone or choosing us-west-2a, us-west-2b, us-west-2c.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/ray/autoscaler/_private/node_launcher.py", line 70, in run
self._launch_node(config, count, node_type)
File "/usr/local/lib/python3.7/site-packages/ray/autoscaler/_private/node_launcher.py", line 60, in _launch_node
self.provider.create_node(node_config, node_tags, count)
File "/usr/local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.py", line 311, in create_node
created_nodes_dict = self._create_node(node_config, tags, count)
File "/usr/local/lib/python3.7/site-packages/ray/autoscaler/_private/aws/node_provider.py", line 403, in _create_node
"Failed to launch instances. Max attempts exceeded.")
File "/usr/local/lib/python3.7/site-packages/ray/autoscaler/_private/cli_logger.py", line 585, in abort
raise exc_cls("Exiting due to cli_logger.abort()")
click.exceptions.ClickException: Exiting due to cli_logger.abort()
2021-02-10 05:24:28,538 INFO node_launcher.py:78 -- NodeLauncher0: Got 2 nodes to launch.
2021-02-10 05:24:28,774 INFO node_launcher.py:78 -- NodeLauncher0: Launching 2 nodes, type ray-legacy-worker-node-type.
==> /tmp/ray/session_latest/logs/monitor.out <==
2021-02-10 05:24:28,538 PANIC node_provider.py:403 -- Failed to launch instances. Max attempts exceeded.
2021-02-10 05:24:29,000 INFO node_provider.py:408 -- create_instances: Attempt failed with An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient r5n.24xlarge capacity in the Availability Zone you requested (us-west-2a). Our system will be working on provisioning additional capacity. You can currently get r5n.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-west-2b, us-west-2c., retrying.
2021-02-10 05:24:29,023 INFO node_provider.py:408 -- create_instances: Attempt failed with An error occurred (RequestLimitExceeded) when calling the RunInstances operation (reached max retries: 0): Request limit exceeded., retrying.
So it looks like the requested instance type isn't really available by EC2, and Ray isn't able to spin up desired worker nodes. This means I need to run ray down
, modify the my_cluster.yaml
, then retry with a different instance type.
I was wondering if we can improve this experience. Perhaps check the EC2 instance capacity for at least minimum # of workers before telling the user that cluster is launched? Or perhaps let user specify list of instance types that they're OK with?
An example showing how to specify multiple candidate instance types can be seen at: https://github.com/amzn/amazon-ray/blob/main/python/ray/autoscaler/aws/example-multi-node-type.yaml
Out of curiosity, did you specify any quantity for min_workers
in your autoscaler config? I think the existing behavior is probably OK in the event that Ray is unable to spin your cluster up to the desired max_workers
capacity, but I would agree that it's misleading to tell a user that their cluster has been successfully launched after running ray up
if we haven't yet started the provisioning process for the requested count of min_workers
.
Yes my min_workers was 34, and max_workers was 34 (for debugging purposes I had set it like that). I was perplexed when I couldn't see any workers on EC2 console!