skypilot icon indicating copy to clipboard operation
skypilot copied to clipboard

Incorrect SKY_NODE_RANK on nodes

Open iojw opened this issue 2 years ago • 5 comments

I would like to run a command only on the head node when the SkyPilot cluster launches. However, the SKY_NODE_RANK environment variable does not have the expected value of 0 on the head node on AWS.

Here is an example config that reproduces this:

resources:
  accelerators: T4:1
  cloud: aws
  disk_size: 1024

workdir: .

num_nodes: 4

run: |
  echo $SKY_NODE_RANK

Launching this cluster, you can observe that SKY_NODE_RANK is printed as 0 on worker nodes based on their IP addresses.

Use case: I would like to run distributed training using ludwig and ray. Since ludwig uses ray for the distributed training, the ludwig train command has to be executed on the ray head node.

iojw avatar Oct 06 '22 17:10 iojw

Thanks @iojw - do you have time to look into this? May be a good chance to dig into the backend.

concretevitamin avatar Oct 06 '22 18:10 concretevitamin

@concretevitamin Yep! I can take this. Just to confirm, what's the expected value of SKY_NODE_RANK for all nodes? Is it based on some sort of ordering, or is it simply non-zero for non-head nodes?

iojw avatar Oct 06 '22 20:10 iojw

It should be

head: 0
worker1: 1
worker2: 2
...

etc. in a stable fashion. The stableness part may be tricky -- @Michaelvll can comment more on the current treatment -- e.g., we can sort by IPs, but after stopping + restarting the IPs may change.

concretevitamin avatar Oct 06 '22 20:10 concretevitamin

Yes, I think the keeping the order the same as mentioned above is a good idea. If the user request a subset of the nodes, we can still keep the same order, i.e.:

head: 0
worker2: 1

or

worker1: 0
worker2: 1

Our current codepath for reserving the nodes are in https://github.com/skypilot-org/skypilot/blob/f9b530038f2c607b0062f2d203d95e5da878dad9/sky/backends/cloud_vm_ray_backend.py#L246 Ray placement group will randomly allocate idle resources to the bundle. I think what we can do is in the local code is that we get the list of the nodes (which should already done by backend_utils.get_node_ips when submitting the task), and pass the list of ips to the ray code generation, and have some logic to map from IPs get by the following code to the correct RANK according to the list. https://github.com/skypilot-org/skypilot/blob/f9b530038f2c607b0062f2d203d95e5da878dad9/sky/backends/cloud_vm_ray_backend.py#L269-L280.

Michaelvll avatar Oct 06 '22 20:10 Michaelvll

It is possible that, because the application is using Ray, it is messing up the order of SKY_NODE_RANK. Perhaps try & replicate it on an empty cluster first?

michaelzhiluo avatar Oct 07 '22 18:10 michaelzhiluo

@michaelzhiluo The simple config I listed in the post is capable of reproducing this issue!

I believe the issue is that we assign ranks based on the list of ips in the placement group returned by the ray.get() call in the following code, but the head node is not always the first node in this list. https://github.com/skypilot-org/skypilot/blob/f9b530038f2c607b0062f2d203d95e5da878dad9/sky/backends/cloud_vm_ray_backend.py#L269-L280

@Michaelvll backend_utils.get_node_ips returns the public ips of the nodes, but the generated ray code uses the private ips of the different nodes instead so we cannot map it directly. Do you have any idea on how we might be able to go from public -> private ips or the other way around?

iojw avatar Oct 19 '22 21:10 iojw

Some ideas to get public/private IPs:

  • See if clouds' CLIs provide such a tool. Maybe https://docs.aws.amazon.com/cli/latest/reference/ec2/describe-addresses.html?
  • Log in to each node and run hostname -I

concretevitamin avatar Oct 21 '22 04:10 concretevitamin

Another way to get internal ips with our current get_node_ips is to create a temporary cluster yaml from the existing one, by adding use_internal_ips: true under the provider section. (reference)

That is to say, we can use the following ray_config, change it to ray_config['provider']['use_internal_ips'] = True, and write to a tempfile tmp_cluster.yaml. https://github.com/skypilot-org/skypilot/blob/f8ae4a10fe45df7a8d9b8bbaa45443a6ddf49a45/sky/backends/backend_utils.py#L1135 With that new yaml file, if we call the following line, I believe it should return the internal ips. https://github.com/skypilot-org/skypilot/blob/f8ae4a10fe45df7a8d9b8bbaa45443a6ddf49a45/sky/backends/backend_utils.py#L1163-L1164

We can add an argument get_internal_ips: bool to the get_node_ips function if needed.

Michaelvll avatar Oct 21 '22 06:10 Michaelvll

Closed by #1291.

iojw avatar Nov 04 '22 09:11 iojw