ray
ray copied to clipboard
[Core] BUG: Cluster crashes when using temp_dir "could not connect to socket" raylet.x [since 2.7+]
What happened + What you expected to happen
In the newer versions (since ray 2.7, but not in 2.6.3) I get the following error when submitting a script to a local cluster that has worker nodes connected to it AND using a temp dir.
Error appears: start head node -> add workers -> submit job / start script on head node (error appears)
Error does not appear
- start head -node -> submit job /start script -> then add workers (autoscale), no error anymore... However, once I stop a script running on the cluster, and then submit it again, it will crash in the same way.
- NOT using a tempdir -> no error anymore
The error is of the following type:
raylet_client.cc:60: Could not connect to socket "logdir_path/sockets/raylet.3" .. followed by a bunch of stacktraces. It could also be raylet.2 or any other raylet.x ...
Again, it only happens with a temp_dir, and the error is not an issue in ray 2.6.3 or older, only in newer versions.
Versions / Dependencies
ray 2.7 or newer
Reproduction script
See above simple pipeline, with any basic ray script e.g.
- on head node: ray start --head --temp_dir=/temp_dir/
- on worker node: connect a worker via "ray start" and connect to the head node
- execute a script on the cluster, e.g. the following on the head node:
import ray
ray.init(address='auto', _temp_dir="/temp_dir/")
@ray.remote
def square(x):
return x * x
futures = [square.remote(i) for i in range(4)]
print(ray.get(futures))
Issue Severity
This issue is pretty severe, since I cannot use the temp dir anymore or have to restart the whole ray instance anytime I want to start a new training job (Again the order matters: It is possible to use a temp dir, by first starting the head node, then sending the script, and only then scaling the cluster .. or not using any temp dir at all, but that is not possible for me).
I am having this same issue in version 2.9.2 and 2.9.3. We've been using ray for years on fairly large HPCs and have not seen this error before. The error raylet_client.cc:60: Could not connect to socket ... only pops up when doing multi-node clusters. Single node ray clusters are not showing this error.
Some notes on my setup...
Host 1 : ray start --head --port=6379 --redis-password=redacted --num-cpus=96 --temp-dir=/path/on/nfs
Host 2 : ray start --address=host1:6379 --redis-password=redacted --temp-dir=/path/on/nfs --num-cpus=96
Both hosts appear to start ray as expected with no errors.
Simply trying to connect to the cluster on host 1 will generate the error.
python -c "import ray; ray.init(address='auto', _redis_password='redacted') ; print('Ray nodes: ', len(ray.nodes()))"
Edit 1:
One important thing to note here is that the temp dir path is on a shared filesystem. We haven't had issues in the past using NFS path for a temp dir. Could it be that host1 or host2 are trying to read/write to socket files that were created by the other host?
Edit 2:
Same error with ray 2.10.0 as well
I can reproduce this issue on my cluster. Downgrading to 2.6.3 works as well (thanks for mentioning that!).
I've run into this issue as well with a setup where temp_dir is a directory on a NFS. If I use Singularity and mount a distinct directory for each node, rather than using a single directory shared across all nodes, I won't see the failure to connect to the raylet socket.
If I use Singularity and mount a distinct directory for each node, rather than using a single directory shared across all nodes
This also worked for me with individual ray tasks and was my go-to workaround before finding that downgrading helps, too.
However, for distributed ray tasks, this did not work because, if I understood the failure logs correctly, the individual ray workers all use the head worker's temp_dir.
However, for distributed ray tasks, this did not work because, if I understood the failure logs correctly, the individual ray workers all use the head worker's
temp_dir.
With Singularity, I was able to different directories to the directory pointed to by temp_dir. So, each node still uses the path specified by the head's temp_dir, but that path points to a different directory.
But this is mere a workaround, right? Maybe some of the developers (@rkooo567) can chime in?
Yes, it feels like it's just a workaround to me, unless there was an expectation that temp_dir isn't a directory shared across nodes. Either there's missing documentation about that expectation or there's a bug with how temp_dir is leveraged by the nodes.
@thoglu are your head nodes and workers nodes run on the same machine so that temp_dir of head node and worker nodes points to the same physical directory?
@jjyao no the headnodes and worker nodes run on different machines with a shared filesystem. The temp_dir is only defined on the headnode in ray start --temp_dir=... , doing the same on the worker nodes says that temp_dir is ignored on the worker nodes and should be defined on the head node, so it is not passed there.
Again, in 2.6.3 it works, starting at 2.7.0 this setup fails with the above error.
Hypothesis: we create unix sockets in temp dirs, but NFS does not support unix sockets so the creation failed, then the reading failed.
@rynewang and that happens since 2.7.0 ? Remember the issue is not present in 2.6.3 and before. If I switch off the custom temp dir, and ray used the default (in /tmp/... I believe), it works.... On top of the temp dir issue, in the new version I was not able to save logging artifacts (plots etc) within each job subdirectory (using log_to_file=True), instead they were saved in the logging dir under "artifacts".. in the documentation here: https://docs.ray.io/en/latest/train/user-guides/persistent-storage.html , it seems I need to specify "storage_path" and put the shared filesystem path there?
Also having this issue, collaborator using ray on HPC via slurm and wants to preserve the logs for later inspection. We are attempting to do so via passing --temp-dir option to start head node, where that points to a shared remote filesystem (upon which sockets do not work). When the job gets scheduled on the same node as head, it looks like ray detects this and tries to use the socket. If started from a separate node, seems to work fine (can confirm this behavior on kubernetes based cluster environment).
Issue for us could either be solved by being able to specify a separate socket dir (local FS to node), disable using unix sockets, or by being able to specify a non-temp log dir (this may in fact be most appropriate, since the "temp-dir" sounds temporary, but logs don't seem to really fit with this designation)
I also encountered this issue with ray 2.38.0
I've run into this issue as well with a setup where
temp_diris a directory on a NFS. If I use Singularity and mount a distinct directory for each node, rather than using a single directory shared across all nodes, I won't see the failure to connect to the raylet socket.
@keatincf Do you mind sharing the .bs script which achieves this?
我在 ray 2.38.0 中也遇到了这个问题
I also encountered this problem in 2.40, we applied for a shared storage to store all the logs. head and worker nodes My /tmp/ray points to the same disk. It is only normal when I submit for the first time. cloud not connect xx socket will appear in subsequent submissions. Either pull up the cluster again or add entrypoint_num_cpus and other parameters in ray job submit. But I don't know if this has any other effects. There's nothing good about it now
但是,对于分布式 ray 任务,这不起作用,因为如果我正确理解了失败日志,各个 ray 工作程序都使用 head worker 的
temp_dir。使用 Singularity,我能够将不同的目录与
temp_dir指向的目录不同。因此,每个节点仍然使用头的temp_dir指定的路径,但该路径指向不同的目录。
May I ask how you solve this problem? We also have a similar problem now, we use the default /tmp/ray to point to a shared storage, and head and worker to point to shared storage PVC
I've run into this issue as well with a setup where
temp_diris a directory on a NFS. If I use Singularity and mount a distinct directory for each node, rather than using a single directory shared across all nodes, I won't see the failure to connect to the raylet socket.@keatincf Do you mind sharing the .bs script which achieves this?
I can't share my script, but I pretty much just pointed to a /tmp directory that was available on each node, but wasn't a shared directory between the nodes.