ray icon indicating copy to clipboard operation
ray copied to clipboard

[Core] BUG: Cluster crashes when using temp_dir "could not connect to socket" raylet.x [since 2.7+]

Open thoglu opened this issue 3 months ago • 9 comments

What happened + What you expected to happen

In the newer versions (since ray 2.7, but not in 2.6.3) I get the following error when submitting a script to a local cluster that has worker nodes connected to it AND using a temp dir.

Error appears: start head node -> add workers -> submit job / start script on head node (error appears)

Error does not appear

  1. start head -node -> submit job /start script -> then add workers (autoscale), no error anymore... However, once I stop a script running on the cluster, and then submit it again, it will crash in the same way.
  2. NOT using a tempdir -> no error anymore

The error is of the following type:

raylet_client.cc:60: Could not connect to socket "logdir_path/sockets/raylet.3" .. followed by a bunch of stacktraces. It could also be raylet.2 or any other raylet.x ...

Again, it only happens with a temp_dir, and the error is not an issue in ray 2.6.3 or older, only in newer versions.

Versions / Dependencies

ray 2.7 or newer

Reproduction script

See above simple pipeline, with any basic ray script e.g.

  1. on head node: ray start --head --temp_dir=/temp_dir/
  2. on worker node: connect a worker via "ray start" and connect to the head node
  3. execute a script on the cluster, e.g. the following on the head node:
import ray

ray.init(address='auto', _temp_dir="/temp_dir/")

@ray.remote
def square(x):
    return x * x

futures = [square.remote(i) for i in range(4)]

print(ray.get(futures))

Issue Severity

This issue is pretty severe, since I cannot use the temp dir anymore or have to restart the whole ray instance anytime I want to start a new training job (Again the order matters: It is possible to use a temp dir, by first starting the head node, then sending the script, and only then scaling the cluster .. or not using any temp dir at all, but that is not possible for me).

thoglu avatar Apr 02 '24 21:04 thoglu

I am having this same issue in version 2.9.2 and 2.9.3. We've been using ray for years on fairly large HPCs and have not seen this error before. The error raylet_client.cc:60: Could not connect to socket ... only pops up when doing multi-node clusters. Single node ray clusters are not showing this error.

Some notes on my setup...

Host 1 : ray start --head --port=6379 --redis-password=redacted --num-cpus=96 --temp-dir=/path/on/nfs Host 2 : ray start --address=host1:6379 --redis-password=redacted --temp-dir=/path/on/nfs --num-cpus=96

Both hosts appear to start ray as expected with no errors.

Simply trying to connect to the cluster on host 1 will generate the error.

python -c "import ray; ray.init(address='auto', _redis_password='redacted') ; print('Ray nodes: ', len(ray.nodes()))"

Edit 1:

One important thing to note here is that the temp dir path is on a shared filesystem. We haven't had issues in the past using NFS path for a temp dir. Could it be that host1 or host2 are trying to read/write to socket files that were created by the other host?

Edit 2:

Same error with ray 2.10.0 as well

sylv01 avatar Apr 10 '24 14:04 sylv01

I can reproduce this issue on my cluster. Downgrading to 2.6.3 works as well (thanks for mentioning that!).

LennartPurucker avatar Apr 24 '24 22:04 LennartPurucker

I've run into this issue as well with a setup where temp_dir is a directory on a NFS. If I use Singularity and mount a distinct directory for each node, rather than using a single directory shared across all nodes, I won't see the failure to connect to the raylet socket.

keatincf avatar May 03 '24 16:05 keatincf

If I use Singularity and mount a distinct directory for each node, rather than using a single directory shared across all nodes

This also worked for me with individual ray tasks and was my go-to workaround before finding that downgrading helps, too.

However, for distributed ray tasks, this did not work because, if I understood the failure logs correctly, the individual ray workers all use the head worker's temp_dir.

LennartPurucker avatar May 03 '24 18:05 LennartPurucker

However, for distributed ray tasks, this did not work because, if I understood the failure logs correctly, the individual ray workers all use the head worker's temp_dir.

With Singularity, I was able to different directories to the directory pointed to by temp_dir. So, each node still uses the path specified by the head's temp_dir, but that path points to a different directory.

keatincf avatar May 03 '24 19:05 keatincf

But this is mere a workaround, right? Maybe some of the developers (@rkooo567) can chime in?

thoglu avatar May 08 '24 13:05 thoglu

Yes, it feels like it's just a workaround to me, unless there was an expectation that temp_dir isn't a directory shared across nodes. Either there's missing documentation about that expectation or there's a bug with how temp_dir is leveraged by the nodes.

keatincf avatar May 08 '24 13:05 keatincf

@thoglu are your head nodes and workers nodes run on the same machine so that temp_dir of head node and worker nodes points to the same physical directory?

jjyao avatar May 13 '24 21:05 jjyao

@jjyao no the headnodes and worker nodes run on different machines with a shared filesystem. The temp_dir is only defined on the headnode in ray start --temp_dir=... , doing the same on the worker nodes says that temp_dir is ignored on the worker nodes and should be defined on the head node, so it is not passed there.

Again, in 2.6.3 it works, starting at 2.7.0 this setup fails with the above error.

thoglu avatar May 15 '24 13:05 thoglu

Hypothesis: we create unix sockets in temp dirs, but NFS does not support unix sockets so the creation failed, then the reading failed.

rynewang avatar May 20 '24 21:05 rynewang