ray-legacy icon indicating copy to clipboard operation
ray-legacy copied to clipboard

Error when starting too many workers.

Open robertnishihara opened this issue 8 years ago • 9 comments

If I start 150 workers on my laptop and run the following, Ray crashes.

import ray

ray.init(start_ray_local=True, num_workers=150)

@ray.remote
def f():
  pass

It fails with

E0915 18:03:15.745690000 123145305522176 wakeup_fd_pipe.c:53] pipe creation failed (24): Too many open files

robertnishihara avatar Sep 15 '16 18:09 robertnishihara

Options:

  1. setrlimit(RLIMIT_NOFILE)
  2. Use e.g. sockets + ports instead of pipes and connect them dynamically.

Though I don't think you should have that many workers on a single machine...

mehrdadn avatar Sep 15 '16 18:09 mehrdadn

Thanks! There's no need to have that many workers on my laptop, but on a machine with 100+ cores it makes sense. I suppose we either want to raise the limit or at least figure out the limit and then throw a Python exception if the user tries to start too many workers.

robertnishihara avatar Sep 15 '16 18:09 robertnishihara

Related note: you need 1 worker process per dependency depth in the computation graph, right?

mehrdadn avatar Sep 15 '16 18:09 mehrdadn

Not quite, the problem is when a remote function calls ray.get because then it blocks while waiting for another remote function to execute. So it's possible for every worker to be executing a remote function which is blocked in a get in which case the program will hang. So as an upper bound, we could need a number of workers equal to one plus the number of tasks that all call get and could be executing at the same time.

robertnishihara avatar Sep 15 '16 18:09 robertnishihara

Yeah, ok. IMO this should not be a constraint at all -- you should be able to run any program to completion with just 1 worker, albeit more slowly. There's at least one way (maybe more) of doing this, but it can require a redesign of the programming model, so you may want to look into that sooner rather than later.

mehrdadn avatar Sep 15 '16 18:09 mehrdadn

How would you do it?

robertnishihara avatar Sep 15 '16 18:09 robertnishihara

Well personally I would do it using the completion-based notification model I'd proposed earlier, which prevents blocking control flow altogether. Another possibility is fibers (cooperative multithreading), so that wait() actually suspends the current fiber and switches to a new one to service new requests without creating a new OS process/thread, but I'm a bit hesitant about this one.

mehrdadn avatar Sep 15 '16 19:09 mehrdadn

@mehrdadn any updates on fibers? Is there any option currently to prevent blocking control flow during get?

ptyshevs avatar Sep 23 '20 17:09 ptyshevs

Uh I'm not sure unfortunately, it's been 4 years. :\

mehrdadn avatar Sep 23 '20 18:09 mehrdadn