ray icon indicating copy to clipboard operation
ray copied to clipboard

[core] ray.get() is likely hung to force kill actor

Open yang20150702 opened this issue 2 weeks ago • 0 comments

Question: ray cluster start error

python-core-woroke-001xxxx_1569.log:

[2025-12-10 04:47:37,669 W 1569 1569] plasma_store_provider.cc:452: Objects 00618f05a32305e8a4f6975825ec1bac3a4279070100000001e1f505 are still not local after 224s. If this message continues to print, ray.get() is likely hung. Please file an issue at https://github.com/ray-project/ray/issues/. [2025-12-10 04:47:38,679 W 1569 1569] plasma_store_provider.cc:452: Objects 00618f05a32305e8a4f6975825ec1bac3a4279070100000001e1f505 are still not local after 225s. If this message continues to print, ray.get() is likely hung. Please file an issue at https://github.com/ray-project/ray/issues/. [2025-12-10 04:47:39,689 W 1569 1569] plasma_store_provider.cc:452: Objects 00618f05a32305e8a4f6975825ec1bac3a4279070100000001e1f505 are still not local after 226s. If this message continues to print, ray.get() is likely hung. Please file an issue at https://github.com/ray-project/ray/issues/. [2025-12-10 04:47:40,700 W 1569 1569] plasma_store_provider.cc:452: Objects 00618f05a32305e8a4f6975825ec1bac3a4279070100000001e1f505 are still not local after 227s. If this message continues to print, ray.get() is likely hung. Please file an issue at https://github.com/ray-project/ray/issues/. [2025-12-10 04:47:41,710 W 1569 1569] plasma_store_provider.cc:452: Objects 00618f05a32305e8a4f6975825ec1bac3a4279070100000001e1f505 are still not local after 228s. If this message continues to print, ray.get() is likely hung. Please file an issue at https://github.com/ray-project/ray/issues/. [2025-12-10 04:47:42,720 W 1569 1569] plasma_store_provider.cc:452: Objects 00618f05a32305e8a4f6975825ec1bac3a4279070100000001e1f505 are still not local after 229s. If this message continues to print, ray.get() is likely hung. Please file an issue at https://github.com/ray-project/ray/issues/. [2025-12-10 04:47:43,742 W 1569 1569] plasma_store_provider.cc:452: Objects 00618f05a32305e8a4f6975825ec1bac3a4279070100000001e1f505 are still not local after 230s. If this message continues to print, ray.get() is likely hung. Please file an issue at https://github.com/ray-project/ray/issues/. [2025-12-10 04:47:44,752 W 1569 1569] plasma_store_provider.cc:452: Objects 00618f05a32305e8a4f6975825ec1bac3a4279070100000001e1f505 are still not local after 231s. If this message continues to print, ray.get() is likely hung. Please file an issue at https://github.com/ray-project/ray/issues/. [2025-12-10 04:47:45,793 W 1569 1569] plasma_store_provider.cc:452: Objects 00618f05a32305e8a4f6975825ec1bac3a4279070100000001e1f505 are still not local after 232s. If this message continues to print, ray.get() is likely hung. Please file an issue at https://github.com/ray-project/ray/issues/. [2025-12-10 04:47:46,803 W 1569 1569] plasma_store_provider.cc:452: Objects 00618f05a32305e8a4f6975825ec1bac3a4279070100000001e1f505 are still not local after 233s. If this message continues to print, ray.get() is likely hung. Please file an issue at https://github.com/ray-project/ray/issues/. [2025-12-10 04:47:47,814 W 1569 1569] plasma_store_provider.cc:452: Objects 00618f05a32305e8a4f6975825ec1bac3a4279070100000001e1f505 are still not local after 234s. If this message continues to print, ray.get() is likely hung. Please file an issue at https://github.com/ray-project/ray/issues/. [2025-12-10 04:47:48,825 W 1569 1569] plasma_store_provider.cc:452: Objects 00618f05a32305e8a4f6975825ec1bac3a4279070100000001e1f505 are still not local after 235s. If this message continues to print, ray.get() is likely hung. Please file an issue at https://github.com/ray-project/ray/issues/. [2025-12-10 04:47:49,837 W 1569 1569] plasma_store_provider.cc:452: Objects 00618f05a32305e8a4f6975825ec1bac3a4279070100000001e1f505 are still not local after 236s. If this message continues to print, ray.get() is likely hung. Please file an issue at https://github.com/ray-project/ray/issues/. [2025-12-10 04:47:50,848 W 1569 1569] plasma_store_provider.cc:452: Objects 00618f05a32305e8a4f6975825ec1bac3a4279070100000001e1f505 are still not local after 237s. If this message continues to print, ray.get() is likely hung. Please file an issue at https://github.com/ray-project/ray/issues/. [2025-12-10 04:47:51,860 W 1569 1569] plasma_store_provider.cc:452: Objects 00618f05a32305e8a4f6975825ec1bac3a4279070100000001e1f505 are still not local after 238s. If this message continues to print, ray.get() is likely hung. Please file an issue at https://github.com/ray-project/ray/issues/. [2025-12-10 04:47:52,870 W 1569 1569] plasma_store_provider.cc:452: Objects 00618f05a32305e8a4f6975825ec1bac3a4279070100000001e1f505 are still not local after 239s. If this message continues to print, ray.get() is likely hung. Please file an issue at https://github.com/ray-project/ray/issues/. [2025-12-10 04:47:53,881 W 1569 1569] plasma_store_provider.cc:452: Objects 00618f05a32305e8a4f6975825ec1bac3a4279070100000001e1f505 are still not local after 240s. If this message continues to print, ray.get() is likely hung. Please file an issue at https://github.com/ray-project/ray/issues/. [2025-12-10 04:47:54,891 W 1569 1569] plasma_store_provider.cc:452: Objects 00618f05a32305e8a4f6975825ec1bac3a4279070100000001e1f505 are still not local after 241s. If this message continues to print, ray.get() is likely hung. Please file an issue at https://github.com/ray-project/ray/issues/. [2025-12-10 04:56:33,089 I 1569 1801] core_worker.cc:4311: Force kill actor request has received. exiting immediately... The actor is dead because all references to the actor were removed. [2025-12-10 04:56:33,089 W 1569 1801] core_worker.cc:1082: Force exit the process. Details: Worker exits because the actor is killed. The actor is dead because all references to the actor were removed. [2025-12-10 04:56:33,093 I 1569 1801] core_worker.cc:979: Try killing all child processes of this worker as it exits. Child process pids: 3730 [2025-12-10 04:56:33,093 I 1569 1801] core_worker.cc:988: Kill result for child pid 3730: Success, bool 0 [2025-12-10 04:56:33,094 I 1569 1801] core_worker.cc:938: Disconnecting to the raylet. [2025-12-10 04:56:33,094 I 1569 1801] raylet_client.cc:162: RayletClient::Disconnect, exit_type=INTENDED_SYSTEM_EXIT, exit_detail=Worker exits because the actor is killed. The actor is dead because all references to the actor were removed., has creation_task_exception_pb_bytes=0

actor.log: File "/app/utils/logging_utils.py", line 131, in wrapped_func return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/app/distributed/coordinators.py", line 567, in sample_thread info = ray.get(future) ^^^^^^^^^^^^^^^ File "/mnt/ephemeral/session_2025-12-10_04-37-10_341481_1/runtime_resources/pip/b55d771da1b2ae82fac5b04cb359e3f11d0d8e6b/virtualenv/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/mnt/ephemeral/session_2025-12-10_04-37-10_341481_1/runtime_resources/pip/b55d771da1b2ae82fac5b04cb359e3f11d0d8e6b/virtualenv/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/mnt/ephemeral/session_2025-12-10_04-37-10_341481_1/runtime_resources/pip/b55d771da1b2ae82fac5b04cb359e3f11d0d8e6b/virtualenv/lib/python3.12/site-packages/ray/_private/worker.py", line 2755, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/ephemeral/session_2025-12-10_04-37-10_341481_1/runtime_resources/pip/b55d771da1b2ae82fac5b04cb359e3f11d0d8e6b/virtualenv/lib/python3.12/site-packages/ray/_private/worker.py", line 906, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(OwnerDiedError): [36mray::JoshuaActor.act()[39m (pid=1233, ip=10.9.132.199, actor_id=acb5eda9cc7f3c0933f1555b01000000, repr=<specific.joshua.actor.JoshuaActor object at 0x7f74351d1c40>) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/distributed/actors.py", line 38, in act moms, sample_info_dict = self.sample(_model_id_weights) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/app/actors.py", line 80, in sample self.agent4rl.set_weights(ray.get(weights), model_index) ^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ray.exceptions.OwnerDiedError: Failed to retrieve object 00618f05a32305e8a4f6975825ec1bac3a4279070100000001e1f505. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during ray start and ray.init().

The object's owner has exited. This is the Python worker that first created the ObjectRef via .remote() or ray.put(). Check cluster logs (/tmp/ray/session_latest/logs/*fa4b65cc068f36afa45fcf9116f0261533f73f25526240122cda43c8* at IP address) for more information about the Python worker failure.

how to fix this question?

yang20150702 avatar Dec 10 '25 06:12 yang20150702