skypilot
skypilot copied to clipboard
Incorrect SKY_NODE_RANK on nodes
I would like to run a command only on the head node when the SkyPilot cluster launches. However, the SKY_NODE_RANK
environment variable does not have the expected value of 0 on the head node on AWS.
Here is an example config that reproduces this:
resources:
accelerators: T4:1
cloud: aws
disk_size: 1024
workdir: .
num_nodes: 4
run: |
echo $SKY_NODE_RANK
Launching this cluster, you can observe that SKY_NODE_RANK
is printed as 0 on worker nodes based on their IP addresses.
Use case: I would like to run distributed training using ludwig and ray. Since ludwig uses ray for the distributed training, the ludwig train
command has to be executed on the ray head node.
Thanks @iojw - do you have time to look into this? May be a good chance to dig into the backend.
@concretevitamin Yep! I can take this. Just to confirm, what's the expected value of SKY_NODE_RANK
for all nodes? Is it based on some sort of ordering, or is it simply non-zero for non-head nodes?
It should be
head: 0
worker1: 1
worker2: 2
...
etc. in a stable fashion. The stableness part may be tricky -- @Michaelvll can comment more on the current treatment -- e.g., we can sort by IPs, but after stopping + restarting the IPs may change.
Yes, I think the keeping the order the same as mentioned above is a good idea. If the user request a subset of the nodes, we can still keep the same order, i.e.:
head: 0
worker2: 1
or
worker1: 0
worker2: 1
Our current codepath for reserving the nodes are in https://github.com/skypilot-org/skypilot/blob/f9b530038f2c607b0062f2d203d95e5da878dad9/sky/backends/cloud_vm_ray_backend.py#L246
Ray placement group will randomly allocate idle resources to the bundle. I think what we can do is in the local code is that we get the list of the nodes (which should already done by backend_utils.get_node_ips
when submitting the task), and pass the list of ips to the ray code generation, and have some logic to map from IPs get by the following code to the correct RANK according to the list.
https://github.com/skypilot-org/skypilot/blob/f9b530038f2c607b0062f2d203d95e5da878dad9/sky/backends/cloud_vm_ray_backend.py#L269-L280.
It is possible that, because the application is using Ray, it is messing up the order of SKY_NODE_RANK
. Perhaps try & replicate it on an empty cluster first?
@michaelzhiluo The simple config I listed in the post is capable of reproducing this issue!
I believe the issue is that we assign ranks based on the list of ips in the placement group returned by the ray.get()
call in the following code, but the head node is not always the first node in this list.
https://github.com/skypilot-org/skypilot/blob/f9b530038f2c607b0062f2d203d95e5da878dad9/sky/backends/cloud_vm_ray_backend.py#L269-L280
@Michaelvll backend_utils.get_node_ips
returns the public ips of the nodes, but the generated ray code uses the private ips of the different nodes instead so we cannot map it directly. Do you have any idea on how we might be able to go from public -> private ips or the other way around?
Some ideas to get public/private IPs:
- See if clouds' CLIs provide such a tool. Maybe https://docs.aws.amazon.com/cli/latest/reference/ec2/describe-addresses.html?
- Log in to each node and run
hostname -I
Another way to get internal ips with our current get_node_ips
is to create a temporary cluster yaml from the existing one, by adding use_internal_ips: true
under the provider
section. (reference)
That is to say, we can use the following ray_config
, change it to ray_config['provider']['use_internal_ips'] = True
, and write to a tempfile tmp_cluster.yaml
.
https://github.com/skypilot-org/skypilot/blob/f8ae4a10fe45df7a8d9b8bbaa45443a6ddf49a45/sky/backends/backend_utils.py#L1135
With that new yaml file, if we call the following line, I believe it should return the internal ips.
https://github.com/skypilot-org/skypilot/blob/f8ae4a10fe45df7a8d9b8bbaa45443a6ddf49a45/sky/backends/backend_utils.py#L1163-L1164
We can add an argument get_internal_ips: bool
to the get_node_ips
function if needed.
Closed by #1291.