ray-legacy
ray-legacy copied to clipboard
Error when starting too many workers.
If I start 150 workers on my laptop and run the following, Ray crashes.
import ray
ray.init(start_ray_local=True, num_workers=150)
@ray.remote
def f():
pass
It fails with
E0915 18:03:15.745690000 123145305522176 wakeup_fd_pipe.c:53] pipe creation failed (24): Too many open files
Options:
-
setrlimit(RLIMIT_NOFILE)
- Use e.g. sockets + ports instead of pipes and connect them dynamically.
Though I don't think you should have that many workers on a single machine...
Thanks! There's no need to have that many workers on my laptop, but on a machine with 100+ cores it makes sense. I suppose we either want to raise the limit or at least figure out the limit and then throw a Python exception if the user tries to start too many workers.
Related note: you need 1 worker process per dependency depth in the computation graph, right?
Not quite, the problem is when a remote function calls ray.get
because then it blocks while waiting for another remote function to execute. So it's possible for every worker to be executing a remote function which is blocked in a get
in which case the program will hang. So as an upper bound, we could need a number of workers equal to one plus the number of tasks that all call get and could be executing at the same time.
Yeah, ok. IMO this should not be a constraint at all -- you should be able to run any program to completion with just 1 worker, albeit more slowly. There's at least one way (maybe more) of doing this, but it can require a redesign of the programming model, so you may want to look into that sooner rather than later.
How would you do it?
Well personally I would do it using the completion-based notification model I'd proposed earlier, which prevents blocking control flow altogether. Another possibility is fibers (cooperative multithreading), so that wait() actually suspends the current fiber and switches to a new one to service new requests without creating a new OS process/thread, but I'm a bit hesitant about this one.
@mehrdadn any updates on fibers? Is there any option currently to prevent blocking control flow during get?
Uh I'm not sure unfortunately, it's been 4 years. :\