ray
ray copied to clipboard
[core][experimental] Calling ray.get() on CompiledDAGRef after dag.teardown() or actor failure hangs
What happened + What you expected to happen
We should throw an error if ray.get() is called on a CompiledDAGRef after the DAG has already been torn down. Instead, it seems that ray.get() returns an infinite string of 0s and hangs.
Versions / Dependencies
3.0dev
Reproduction script
import ray
from ray.dag import InputNode
@ray.remote
class Actor:
def foo(self, arg):
return arg
a = Actor.remote()
with InputNode() as inp:
dag = a.foo.bind(inp)
dag = dag.experimental_compile()
x = dag.execute(1)
ray.kill(a)
# This hangs and returns infinite string of 0s.
ray.get(x)
Issue Severity
None
possible root cause for #46253 ... jack and kai-hsun to look into it.
I reproduced this with the following script:
import ray
from ray.dag import InputNode
@ray.remote
class Actor:
def foo(self, arg):
return arg
a = Actor.remote()
with InputNode() as inp:
dag = a.foo.bind(inp)
dag = dag.experimental_compile()
x = dag.execute(1)
dag.teardown()
# This hangs and returns infinite string of 0s.
print(ray.get(x))
Yes, we want it to work in both cases, whether the DAG has been explicitly torn down or if an actor in the DAG failed. The only difference is that the latter case should also print out an ActorDiedError.
The issue seems to be that an IOError (channel closed) is being properly returned by CoreWorker::Get(), but the Python code is ignoring this error status and doing an invalid memory access to the buffer.
I used the same example as https://github.com/ray-project/ray/issues/46284#issuecomment-2195714993.
I added a print function, print("check_status: ", message, status.ok(), status.IsChannelError()), to check_status.
https://github.com/ray-project/ray/blob/755a49bbc2345650ff90d4706d9c69f87b8a77bd/python/ray/_raylet.pyx#L560-L561
I have two questions:
- I can see
(Actor pid=2054169) check_status: Channel closed. False Trueon my console. I expected that the driver process would also print similar logs for the RayChannelError, but it didn't. - I think
check_statusshould raise aRayChannelErrorbecausestatus.IsChannelError()is True in the above log, andget_objects(code) doesn't have anytry ... exceptlogic, so I expected to see some logs aboutRayChannelError, but I didn't. https://github.com/ray-project/ray/blob/755a49bbc2345650ff90d4706d9c69f87b8a77bd/python/ray/_raylet.pyx#L593
I used the same example as #46284 (comment).
I added a print function,
print("check_status: ", message, status.ok(), status.IsChannelError()), tocheck_status.https://github.com/ray-project/ray/blob/755a49bbc2345650ff90d4706d9c69f87b8a77bd/python/ray/_raylet.pyx#L560-L561
I have two questions:
- I can see
(Actor pid=2054169) check_status: Channel closed. False Trueon my console. I expected that the driver process would also print similar logs for the RayChannelError, but it didn't.- I think
check_statusshould raise aRayChannelErrorbecausestatus.IsChannelError()is True in the above log, andget_objects(code) doesn't have anytry ... exceptlogic, so I expected to see some logs aboutRayChannelError, but I didn't. https://github.com/ray-project/ray/blob/755a49bbc2345650ff90d4706d9c69f87b8a77bd/python/ray/_raylet.pyx#L593
I wrote a response for this in the PR: https://github.com/ray-project/ray/pull/46320#issuecomment-2197438895