ray icon indicating copy to clipboard operation
ray copied to clipboard

[Core] `ray.wait` not actually wait until ready when the task is longer than 12 days

Open Michaelvll opened this issue 1 year ago • 2 comments

What happened + What you expected to happen

For a task longer than 12 days, ray.wait will return an empty list of ready object refs after 10**6 seconds when timeout is not specified, which is about 11.5 days.

This is inconsistent with what ray.get will do when timeout is not specified.

Versions / Dependencies

ray==2.9.3 (but I suppose it happens for all the ray versions) python 3.10 OS Ubuntu 20.04

Reproduction script

https://github.com/ray-project/ray/blob/1ccf9254c16d2cb0237fba5aa0a511c1177181c9/python/ray/_private/worker.py#L2852-L2853

Issue Severity

None

Michaelvll avatar Apr 22 '24 20:04 Michaelvll

Hi @Michaelvll what's the cluster setup. Does the task run on the same node where ray.wait is called?

jjyao avatar Apr 29 '24 21:04 jjyao

Hi @Michaelvll what's the cluster setup. Does the task run on the same node where ray.wait is called?

Yes, the task is run on the same node as the driver, but I believe this happens for multi-node cases as well, due to the code quoted above. ray.get does not have the issue.

Michaelvll avatar Apr 29 '24 22:04 Michaelvll

If it is always set to 10**6 seconds, we probably keep it as is and not break any compatibility.

It makes sense to have some default timeout like that so that api call would not hang forever. Nonetheless, we should change the docs to mention this.

hongchaodeng avatar May 01 '24 23:05 hongchaodeng