tmp directory path issue between Windows client and Linux Ray cluster head
What happened + What you expected to happen
When starting Ray on client in python with: ray.init(address=REMOTE_ADDRESS, ignore_reinit_error=True) I'm getting the following error:
2024-04-25 14:29:30,423 INFO worker.py:1567 -- Connecting to existing Ray cluster at address: <numbercruncher-IP>:6379...
2024-04-25 14:29:31,463 INFO node.py:1010 -- Can't find a `node_ip_address.json` file from /tmp/ray\session_2024-04-25_14-29-35_864002_909423. Have you started Ray instsance using `ray start` or `ray.init`?
It looks for me, that there is an issue in the tmp-path, which is half linuxish half windowish.
Versions / Dependencies
Client and Cluster Head have Ray 2.10.0
Reproduction script
I've started the cluster-head with:
ray start --head --port=6379
And executed on the client a python script with the following lines:
import ray
ray.init(address=REMOTE_ADDRESS, ignore_reinit_error=True)
Issue Severity
High: It blocks me from completing my task.
@Kuurusch instead of Ray client, Can you use ray job submission: https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html? Ray client is not the recommended way to run Ray applications.
same issue,run raycluster on vm1、vm2,and try to connect raycluster on vm3 failed
2024-05-22 16:19:21,977 INFO worker.py:1432 -- Using address 172.16.3.61:6379 set in the environment variable RAY_ADDRESS
2024-05-22 16:19:21,977 INFO worker.py:1567 -- Connecting to existing Ray cluster at address: 172.16.3.61:6379...
2024-05-22 16:19:22,982 INFO node.py:1010 -- Can't find a `node_ip_address.json` file from /tmp/ray/session_2024-05-22_15-18-28_773050_1419. Have you started Ray instsance using `ray start` or `ray.init`?
but run same command success on vm1 or vm2.
I've created a PR in https://github.com/ray-project/ray/pull/45930 to fix the malformed path problem (/tmp/ray\session_2024-04-25_14-29-35_864002_909423). After this PR a windows worker node will successfully join a cluster of linux head and run jobs on it.
For Can't find a node_ip_address.json problem, it is actually due to the usage shown in this issue is not correct. ray.init(address=REMOTE_ADDRESS, ignore_reinit_error=True) will only work when the current machine has already joined the cluster. If not joined, the node_ip_address.json wouldn't exist, resulting in this error.
If you do not want to make your windows machine join the ray cluster as a node, you should use the ray client form for the address, i.e., ray://172.16.3.61:10001 instead of 172.16.3.61:6379. Note the 10001 port is used here for ray-client's server instead of 6379.