ray icon indicating copy to clipboard operation
ray copied to clipboard

tmp directory path issue between Windows client and Linux Ray cluster head

Open Kuurusch opened this issue 1 year ago • 3 comments

What happened + What you expected to happen

When starting Ray on client in python with: ray.init(address=REMOTE_ADDRESS, ignore_reinit_error=True) I'm getting the following error:

2024-04-25 14:29:30,423	INFO worker.py:1567 -- Connecting to existing Ray cluster at address: <numbercruncher-IP>:6379...
2024-04-25 14:29:31,463	INFO node.py:1010 -- Can't find a `node_ip_address.json` file from /tmp/ray\session_2024-04-25_14-29-35_864002_909423. Have you started Ray instsance using `ray start` or `ray.init`?

It looks for me, that there is an issue in the tmp-path, which is half linuxish half windowish.

Versions / Dependencies

Client and Cluster Head have Ray 2.10.0

Reproduction script

I've started the cluster-head with:

ray start --head --port=6379

And executed on the client a python script with the following lines:

import ray

ray.init(address=REMOTE_ADDRESS, ignore_reinit_error=True)

Issue Severity

High: It blocks me from completing my task.

Kuurusch avatar Apr 27 '24 07:04 Kuurusch

@Kuurusch instead of Ray client, Can you use ray job submission: https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html? Ray client is not the recommended way to run Ray applications.

jjyao avatar May 06 '24 21:05 jjyao

same issue,run raycluster on vm1、vm2,and try to connect raycluster on vm3 failed

2024-05-22 16:19:21,977	INFO worker.py:1432 -- Using address 172.16.3.61:6379 set in the environment variable RAY_ADDRESS
2024-05-22 16:19:21,977	INFO worker.py:1567 -- Connecting to existing Ray cluster at address: 172.16.3.61:6379...
2024-05-22 16:19:22,982	INFO node.py:1010 -- Can't find a `node_ip_address.json` file from /tmp/ray/session_2024-05-22_15-18-28_773050_1419. Have you started Ray instsance using `ray start` or `ray.init`?

but run same command success on vm1 or vm2.

lixd avatar May 22 '24 08:05 lixd

I've created a PR in https://github.com/ray-project/ray/pull/45930 to fix the malformed path problem (/tmp/ray\session_2024-04-25_14-29-35_864002_909423). After this PR a windows worker node will successfully join a cluster of linux head and run jobs on it.

For Can't find a node_ip_address.json problem, it is actually due to the usage shown in this issue is not correct. ray.init(address=REMOTE_ADDRESS, ignore_reinit_error=True) will only work when the current machine has already joined the cluster. If not joined, the node_ip_address.json wouldn't exist, resulting in this error.

If you do not want to make your windows machine join the ray cluster as a node, you should use the ray client form for the address, i.e., ray://172.16.3.61:10001 instead of 172.16.3.61:6379. Note the 10001 port is used here for ray-client's server instead of 6379.

Vigilans avatar Jun 13 '24 10:06 Vigilans