Eric Liang
Eric Liang
Should we raise an error if the number of ports is too small for the CPU count? That way we can avoid inadvertent bricking of Ray via config, etc.
What if we just have a FIFO queue of stats? Like the most recent 10000 Dataset stats, which should suffice for almost everyone, but safeguard against any worst case OOMs...
> Add an opt-in flag for enabling multi-node clusters for OSX and Windows Is there a good reason to document this flag? It seems preferable to raise an exception and...
Seems like there's a test_cli failure.
Agree with @raulchen . This is a good fix, but we should make it apply only for map_groups() and not all map operations.
Hey @erezinman , I think this behavior is fixed in Ray nightly, could you verify? This was the tracking issue: https://github.com/ray-project/ray/issues/29624 After the patch, Ray will destroy the task worker...
Yup, that's the same issue then. For actors, we always destroy their worker processes after they are killed, so GPU memory leaks aren't an issue. Previously, for GPU tasks we...
Hmm, that shouldn't happen across runs. Do you have a reproduction script? When I try to run / cancel / resume a job locally, I see trials using new PIDs...
I see. It's possible the GPU processes are leaked. Do you see those processes hanging out in the cluster after the ctrl+x (c)? but before the second run? Do they...
So it looks like Ray is releasing the resources / processes on its side. The processes might somehow still be stuck on shutdown though for some other reason. Could you...