ray icon indicating copy to clipboard operation
ray copied to clipboard

[core][experimental] Calling ray.get() on CompiledDAGRef after dag.teardown() or actor failure hangs

Open stephanie-wang opened this issue 1 year ago • 6 comments
trafficstars

What happened + What you expected to happen

We should throw an error if ray.get() is called on a CompiledDAGRef after the DAG has already been torn down. Instead, it seems that ray.get() returns an infinite string of 0s and hangs.

Versions / Dependencies

3.0dev

Reproduction script

import ray
from ray.dag import InputNode

@ray.remote
class Actor:
    def foo(self, arg):
        return arg
        
a = Actor.remote()
with InputNode() as inp:
    dag = a.foo.bind(inp)
    
dag = dag.experimental_compile()
x = dag.execute(1)
ray.kill(a)
# This hangs and returns infinite string of 0s.
ray.get(x)

Issue Severity

None

stephanie-wang avatar Jun 26 '24 23:06 stephanie-wang

possible root cause for #46253 ... jack and kai-hsun to look into it.

anyscalesam avatar Jun 27 '24 00:06 anyscalesam

I reproduced this with the following script:

import ray
from ray.dag import InputNode

@ray.remote
class Actor:
    def foo(self, arg):
        return arg
        
a = Actor.remote()
with InputNode() as inp:
    dag = a.foo.bind(inp)
    
dag = dag.experimental_compile()
x = dag.execute(1)
dag.teardown()
# This hangs and returns infinite string of 0s.
print(ray.get(x))

jackhumphries avatar Jun 27 '24 21:06 jackhumphries

Yes, we want it to work in both cases, whether the DAG has been explicitly torn down or if an actor in the DAG failed. The only difference is that the latter case should also print out an ActorDiedError.

stephanie-wang avatar Jun 27 '24 22:06 stephanie-wang

The issue seems to be that an IOError (channel closed) is being properly returned by CoreWorker::Get(), but the Python code is ignoring this error status and doing an invalid memory access to the buffer.

jackhumphries avatar Jun 27 '24 22:06 jackhumphries

I used the same example as https://github.com/ray-project/ray/issues/46284#issuecomment-2195714993.

I added a print function, print("check_status: ", message, status.ok(), status.IsChannelError()), to check_status. https://github.com/ray-project/ray/blob/755a49bbc2345650ff90d4706d9c69f87b8a77bd/python/ray/_raylet.pyx#L560-L561

I have two questions:

  • I can see (Actor pid=2054169) check_status: Channel closed. False True on my console. I expected that the driver process would also print similar logs for the RayChannelError, but it didn't.
  • I think check_status should raise a RayChannelError because status.IsChannelError() is True in the above log, and get_objects (code) doesn't have any try ... except logic, so I expected to see some logs about RayChannelError, but I didn't. https://github.com/ray-project/ray/blob/755a49bbc2345650ff90d4706d9c69f87b8a77bd/python/ray/_raylet.pyx#L593

kevin85421 avatar Jun 28 '24 07:06 kevin85421

I used the same example as #46284 (comment).

I added a print function, print("check_status: ", message, status.ok(), status.IsChannelError()), to check_status.

https://github.com/ray-project/ray/blob/755a49bbc2345650ff90d4706d9c69f87b8a77bd/python/ray/_raylet.pyx#L560-L561

I have two questions:

  • I can see (Actor pid=2054169) check_status: Channel closed. False True on my console. I expected that the driver process would also print similar logs for the RayChannelError, but it didn't.
  • I think check_status should raise a RayChannelError because status.IsChannelError() is True in the above log, and get_objects (code) doesn't have any try ... except logic, so I expected to see some logs about RayChannelError, but I didn't. https://github.com/ray-project/ray/blob/755a49bbc2345650ff90d4706d9c69f87b8a77bd/python/ray/_raylet.pyx#L593

I wrote a response for this in the PR: https://github.com/ray-project/ray/pull/46320#issuecomment-2197438895

jackhumphries avatar Jun 28 '24 18:06 jackhumphries