ray
ray copied to clipboard
[core] core worker check-fails when user code raised BaseException
What happened + What you expected to happen
Ray handles user Exceptions and forward them to callers. However when it raises BaseException core_worker check-fails.
Same thing also happens on async code, Actor code, streaming generator code.
% RAY_TASK_MAX_RETRIES=0 python 2.py
2024-02-23 21:10:06,750 INFO worker.py:1744 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
[2024-02-23 21:10:06,752 I 80217 33206410] logging.cc:230: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: e4241f447476edb41a83dd9af5081ce299916b4701000000 Worker ID: 13ba31a37c66713e859b654fd67e42907914c13d1e431228dad9148b Node ID: 9df4d6e42a232cf7f9cf0a22ca3c70c693c8b5a1eb4c0e67a71be092 Worker IP address: 127.0.0.1 Worker port: 60729 Worker PID: 80274 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
(raylet) [2024-02-23 21:10:07,239 I 80266 33206719] logging.cc:230: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1
(f pid=80274) Traceback (most recent call last):
(f pid=80274) File "python/ray/_raylet.pyx", line 2209, in ray._raylet.task_execution_handler
(f pid=80274) File "python/ray/_raylet.pyx", line 2105, in ray._raylet.execute_task_with_cancellation_handler
(f pid=80274) File "python/ray/_raylet.pyx", line 1759, in ray._raylet.execute_task
(f pid=80274) File "python/ray/_raylet.pyx", line 1760, in ray._raylet.execute_task
(f pid=80274) File "python/ray/_raylet.pyx", line 1810, in ray._raylet.execute_task
(f pid=80274) File "python/ray/_raylet.pyx", line 1816, in ray._raylet.execute_task
(f pid=80274) File "/Users/ruiyangwang/tmp/gensegv/2.py", line 5, in f
(f pid=80274) raise BaseException("hehe")
(f pid=80274) BaseException: hehe
(f pid=80274) Exception ignored in: 'ray._raylet.task_execution_handler'
(f pid=80274) Traceback (most recent call last):
(f pid=80274) File "python/ray/_raylet.pyx", line 2209, in ray._raylet.task_execution_handler
(f pid=80274) File "python/ray/_raylet.pyx", line 2105, in ray._raylet.execute_task_with_cancellation_handler
(f pid=80274) File "python/ray/_raylet.pyx", line 1759, in ray._raylet.execute_task
(f pid=80274) File "python/ray/_raylet.pyx", line 1760, in ray._raylet.execute_task
(f pid=80274) File "python/ray/_raylet.pyx", line 1810, in ray._raylet.execute_task
(f pid=80274) File "python/ray/_raylet.pyx", line 1816, in ray._raylet.execute_task
(f pid=80274) File "/Users/ruiyangwang/tmp/gensegv/2.py", line 5, in f
(f pid=80274) raise BaseException("hehe")
(f pid=80274) BaseException: hehe
(f pid=80274) [2024-02-23 21:10:07,314 C 80274 33206736] direct_actor_transport.cc:205: Check failed: objects_valid
(f pid=80274) *** StackTrace Information ***
(f pid=80274) 0 _raylet.so 0x0000000103b526a4 _ZN3raylsERNSt3__113basic_ostreamIcNS0_11char_traitsIcEEEERKNS_10StackTraceE + 84 ray::operator<<()
(f pid=80274) 1 _raylet.so 0x0000000103b78768 _ZN3ray13SpdLogMessage5FlushEv + 220 ray::SpdLogMessage::Flush()
(f pid=80274) 2 _raylet.so 0x0000000103b785e8 _ZN3ray13SpdLogMessageD2Ev + 24 ray::SpdLogMessage::~SpdLogMessage()
(f pid=80274) 3 _raylet.so 0x0000000103b55748 _ZN3ray6RayLogD2Ev + 52 ray::RayLog::~RayLog()
(f pid=80274) 4 _raylet.so 0x0000000103350750 _ZNSt3__110__function6__funcIZN3ray4core28CoreWorkerDirectTaskReceiver10HandleTaskERKNS2_3rpc15PushTaskRequestEPNS5_13PushTaskReplyENS_8functionIFvNS2_6StatusENSB_IFvvEEESE_EEEE3$_0NS_9allocatorISH_EEFvSG_EEclEOSG_ + 5464 std::__1::__function::__func<>::operator()()
(f pid=80274) 5 _raylet.so 0x000000010332da18 _ZN3ray4core14InboundRequest6AcceptEv + 136 ray::core::InboundRequest::Accept()
(f pid=80274) 6 _raylet.so 0x000000010336b268 _ZN3ray4core21NormalSchedulingQueue16ScheduleRequestsEv + 320 ray::core::NormalSchedulingQueue::ScheduleRequests()
(f pid=80274) 7 _raylet.so 0x00000001035a10e4 _ZN12EventTracker15RecordExecutionERKNSt3__18functionIFvvEEENS0_10shared_ptrI11StatsHandleEE + 260 EventTracker::RecordExecution()
(f pid=80274) 8 _raylet.so 0x000000010359b3b4 _ZNSt3__110__function6__funcIZN23instrumented_io_context4postENS_8functionIFvvEEENS_12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEExE3$_0NS9_ISC_EES4_EclEv + 56 std::__1::__function::__func<>::operator()()
(f pid=80274) 9 _raylet.so 0x000000010359ac70 _ZN5boost4asio6detail18completion_handlerINSt3__18functionIFvvEEENS0_10io_context19basic_executor_typeINS3_9allocatorIvEELm0EEEE11do_completeEPvPNS1_19scheduler_operationERKNS_6system10error_codeEm + 220 boost::asio::detail::completion_handler<>::do_complete()
(f pid=80274) 10 _raylet.so 0x0000000103c9097c _ZN5boost4asio6detail9scheduler10do_run_oneERNS1_27conditionally_enabled_mutex11scoped_lockERNS1_21scheduler_thread_infoERKNS_6system10error_codeE + 656 boost::asio::detail::scheduler::do_run_one()
(f pid=80274) 11 _raylet.so 0x0000000103c85dfc _ZN5boost4asio6detail9scheduler3runERNS_6system10error_codeE + 200 boost::asio::detail::scheduler::run()
(f pid=80274) 12 _raylet.so 0x0000000103c85ce4 _ZN5boost4asio10io_context3runEv + 32 boost::asio::io_context::run()
(f pid=80274) 13 _raylet.so 0x000000010320aa68 _ZN3ray4core10CoreWorker20RunTaskExecutionLoopEv + 200 ray::core::CoreWorker::RunTaskExecutionLoop()
(f pid=80274) 14 _raylet.so 0x00000001032b9f0c _ZN3ray4core21CoreWorkerProcessImpl26RunWorkerTaskExecutionLoopEv + 332 ray::core::CoreWorkerProcessImpl::RunWorkerTaskExecutionLoop()
(f pid=80274) 15 _raylet.so 0x00000001032b9d98 _ZN3ray4core17CoreWorkerProcess20RunTaskExecutionLoopEv + 32 ray::core::CoreWorkerProcess::RunTaskExecutionLoop()
(f pid=80274) 16 _raylet.so 0x00000001030e7650 _ZL50__pyx_pw_3ray_7_raylet_10CoreWorker_7run_task_loopP7_objectS0_ + 24 __pyx_pw_3ray_7_raylet_10CoreWorker_7run_task_loop()
(f pid=80274) 17 python3.10 0x00000001008a6b3c method_vectorcall_NOARGS + 116 method_vectorcall_NOARGS
(f pid=80274) 18 python3.10 0x00000001009ad1c4 _PyEval_EvalFrameDefault + 34136 _PyEval_EvalFrameDefault
(f pid=80274) 19 python3.10 0x0000000100897cf4 _PyFunction_Vectorcall + 548 _PyFunction_Vectorcall
(f pid=80274) 20 python3.10 0x00000001009ad1c4 _PyEval_EvalFrameDefault + 34136 _PyEval_EvalFrameDefault
(f pid=80274) 21 python3.10 0x00000001009a2f60 _PyEval_Vector + 532 _PyEval_Vector
(f pid=80274) 22 python3.10 0x0000000100a1c27c run_mod + 220 run_mod
(f pid=80274) 23 python3.10 0x0000000100a1c01c pyrun_file + 156 pyrun_file
(f pid=80274) 24 python3.10 0x0000000100a1ba68 _PyRun_SimpleFileObject + 316 _PyRun_SimpleFileObject
(f pid=80274) 25 python3.10 0x0000000100a1b3d0 _PyRun_AnyFileObject + 216 _PyRun_AnyFileObject
(f pid=80274) 26 python3.10 0x0000000100a3fa8c pymain_run_file_obj + 196 pymain_run_file_obj
(f pid=80274) 27 python3.10 0x0000000100a3f318 pymain_run_file + 72 pymain_run_file
(f pid=80274) 28 python3.10 0x0000000100a3e9b8 pymain_run_python + 340 pymain_run_python
(f pid=80274) 29 python3.10 0x0000000100a3e80c Py_RunMain + 40 Py_RunMain
(f pid=80274) 30 python3.10 0x0000000100837b58 main + 56 main
(f pid=80274) 31 dyld 0x0000000194fcff28 start + 2236 start
(f pid=80274)
Traceback (most recent call last):
File "/Users/ruiyangwang/tmp/gensegv/2.py", line 7, in <module>
ray.get(f.remote())
File "/Users/ruiyangwang/gits/ray/python/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
File "/Users/ruiyangwang/gits/ray/python/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/Users/ruiyangwang/gits/ray/python/ray/_private/worker.py", line 2647, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/Users/ruiyangwang/gits/ray/python/ray/_private/worker.py", line 866, in get_objects
raise value
ray.exceptions.WorkerCrashedError: The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more information.
(raylet) [2024-02-23 21:10:07,293 I 80274 33206736] logging.cc:230: Set ray log level from environment variable RAY_BACKEND_LOG_LEVEL to -1 [repeated 8x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
Versions / Dependencies
master
Reproduction script
import ray
@ray.remote
def f():
raise BaseException("hehe")
ray.get(f.remote())
Issue Severity
Low: It annoys or frustrates me.
I'm seeing the following issue when running ci/lint/format.sh:
$ ci/lint/format.sh
From github.com:ray-project/ray
* branch master -> FETCH_HEAD
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/flake8/checker.py", line 478, in run_ast_checks
ast = self.processor.build_ast()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/flake8/processor.py", line 225, in build_ast
return ast.parse("".join(self.lines))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/anaconda3/lib/python3.11/ast.py", line 50, in parse
return compile(source, filename, mode, flags,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<unknown>", line 7
from cpython.exc cimport PyErr_CheckSignals
^^^^^^^
SyntaxError: invalid syntax
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/bin/flake8", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/flake8/main/cli.py", line 22, in main
app.run(argv)
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/flake8/main/application.py", line 363, in run
self._run(argv)
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/flake8/main/application.py", line 351, in _run
self.run_checks()
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/flake8/main/application.py", line 264, in run_checks
self.file_checker_manager.run()
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/flake8/checker.py", line 323, in run
self.run_serial()
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/flake8/checker.py", line 307, in run_serial
checker.run_checks()
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/flake8/checker.py", line 589, in run_checks
self.run_ast_checks()
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/flake8/checker.py", line 480, in run_ast_checks
row, column = self._extract_syntax_information(e)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/anaconda3/lib/python3.11/site-packages/flake8/checker.py", line 465, in _extract_syntax_information
lines = physical_line.rstrip("\n").split("\n")
^^^^^^^^^^^^^^^^^^^^
AttributeError: 'int' object has no attribute 'rstrip'
This error seems to appear for any change to _raylet.pyx, including just adding an empty line. I am still trying to figure out the cause of this.