rayfed
rayfed copied to clipboard
Add user friendly debug log
debug info like:
2023-02-10 07:06:00,812 WARNING worker.py:1851 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff90dafcf9eb358393d790d3d503000000 Worker ID: a3ecdc4fd2ea61036a0ff1d373a1d904fef544dec5e55a7b1b0ba0f3 Node ID: ae60bbcf8816aaa352738f655f6066e8dc9c1cecf951a0f62cef41b9 Worker IP address: 172.16.201.146 Worker port: 10014 Worker PID: 1377 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
2023-02-10 07:06:00 WARNING fed.cleanup [alice] -- Failed to send ObjectRef(4984f7d6f9a761ad90dafcf9eb358393d790d3d50300000001000000) with error: The actor died unexpectedly before finishing this task.
class_name: SendProxyActor
actor_id: 90dafcf9eb358393d790d3d503000000
pid: 1377
name: SendProxyActor
namespace: c2ec2b3b-3cf8-46b1-8c9f-3e9985157c6d
ip: 172.16.201.146
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
2023-02-10 07:06:00 WARNING fed.cleanup [alice] -- Signal self to exit.
*** SIGTERM received at time=1676012760 on cpu 10 ***
2023-02-10 07:06:00 INFO fed.cleanup [alice] -- Check sending thread was exited.
PC: @ 0x7f9f2f4f174a (unknown) pthread_cond_timedwait@@GLIBC_2.3.2
@ 0x7f9f2f4f5c20 (unknown) (unknown)
[2023-02-10 07:06:00,816 E 1214 1214] logging.cc:361: *** SIGTERM received at time=1676012760 on cpu 10 ***
[2023-02-10 07:06:00,817 E 1214 1214] logging.cc:361: PC: @ 0x7f9f2f4f174a (unknown) pthread_cond_timedwait@@GLIBC_2.3.2
[2023-02-10 07:06:00,817 E 1214 1214] logging.cc:361: @ 0x7f9f2f4f5c20 (unknown) (unknown)
(secretflow) bash-4.4# python sl_test_alice.py
should provide a more user friendly log show connection error.
Good issue. Do you have any thought on how we improve that?
I met the same issue. I am trying to debug this problem and wondering what kind of connection error will cause this.