[Core] log_to_driver=False does not suppress worker errors in ipython
What happened + What you expected to happen
When a ray task raises an error in an ipython session, I get an "Unhandled error" message in my session output after some delay (about 1 second). If I initialize ray with log_to_driver=False I still get the error. Setting environment variable RAY_IGNORE_UNHANDLED_ERRORS=1 does fix the error, but I think that log_to_driver should suppress all ray task errors. I know it suppresses warnings.
Error message:
2022-08-31 14:39:17,467 ERROR worker.py:399 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::raise_error() (pid=55178, ip=127.0.0.1)
File "<ipython-input-1-4cb397045297>", line 8, in raise_error
RuntimeError
This error is relevant to Modin, which saves object ids from remote function calls. Modin users in ipython get spammed with ray task errors in addition to getting the nicer exception in the main thread caused by finally ray.geting the oid from the task that raised the error. It would be excellent for the Modin user experience if there were a way to suppress the errors. RAY_IGNORE_UNHANDLED_ERRORS=1 doesn't work because we shouldn't update the user's environment. Suppressing errors with log_to_driver=False would be great.
Versions / Dependencies
ray 2.0.0 macOS Monterey 12.4 python 3.10.4 ipython 8.4.0
Reproduction script
import ray
import time
ray.init(log_to_driver=False)
@ray.remote
def raise_error():
raise RuntimeError()
oid = raise_error.remote()
time.sleep(5)
Issue Severity
Low: It annoys or frustrates me.
Having the same issue and this is very annoying. log_to_driver=False does nothing and the console is full of ray system logs making it impossible to see the logs emitted by application code. How to disable Ray system logs??
@simon-mo could you please take a look? I marked the priority as "low" because the issue doesn't affect correctness or performance, but the issue does annoy me a lot.
@mvashishtha thanks for the ping. This is partially the expected behavior because log_to_driver is about your (user side) worker's print and logs. The unhandled error is propagated through the Ray system logs. There might be a place to silence it, or apply this flag. @scv119 can you help triage?
@scv119 could you please take a look?
@scv119 could you please take a look?
In my opinion the worker errors are significantly degrading the experience of using Modin on Ray.
did some investigation. as @simon-mo mentioned this seems to be the right behavior. You can suppress the warning by setting os.environ["RAY_IGNORE_UNHANDLED_ERRORS"]="1" before ray.init
@scv119 why do you say this is the right behavior? The documentation of log_to_driver says "If true, the output from all of the worker processes on all nodes will be directed to the driver.". Error output from nodes is output, just as warnings are. I think the right behavior would be to suppress the error output.
@mvashishtha in Ray, unhandled errors are treated different from output.
Specifically, if a task throws an unhandled error, the error message will show up in the output of the caller. In your case, the caller happened to be in the driver, so log_to_driver=False doesn't suppress the message. Instead, RAY_IGNORE_UNHANDLED_ERRORS env is intended to suppress unhandled errors show up in the output of the caller.
I think there is an option to create a different flag to only control this behavior (i.e. don't print unhandled error into output) only for driver... but that's a public API change. @rkooo567 @jjyao thought?
Some context: https://github.com/ray-project/ray/pull/15432.
@mvashishtha want to make sure I understand it correctly: are you saying those errors are eventually handled by calling ray.get(), it just happens a while (5s) after the error is generated? Is my understanding correct.
@jjyao this error happens when object is being GCed, even ray.get() is not called.
IIUC, currently the error will be printed out even if we handle the error (e.g. ray.get()) but just not fast enough. I want to confirm if this is the case for them.
@jjyao in my original script, if I add ray.get(oid) to the end of the script, I get a nice stack trace from the handled error:
Stack trace
---------------------------------------------------------------------------
RayTaskError(RuntimeError) Traceback (most recent call last)
Input In [2], in <cell line: 1>()
----> 1 ray.get(oid)
File ~/opt/anaconda3/envs/modin-dev/lib/python3.10/site-packages/ray/_private/client_mode_hook.py:105, in client_mode_hook.<locals>.wrapper(*args, **kwargs)
103 if func.__name__ != "init" or is_client_mode_enabled_by_default:
104 return getattr(ray, func.__name__)(*args, **kwargs)
--> 105 return func(*args, **kwargs)
File ~/opt/anaconda3/envs/modin-dev/lib/python3.10/site-packages/ray/_private/worker.py:2275, in get(object_refs, timeout)
2273 worker.core_worker.dump_object_store_memory_usage()
2274 if isinstance(value, RayTaskError):
-> 2275 raise value.as_instanceof_cause()
2276 else:
2277 raise value
RayTaskError(RuntimeError): ray::raise_error() (pid=50904, ip=127.0.0.1)
File "<ipython-input-1-4cb397045297>", line 8, in raise_error
RuntimeError
but I also get a message from the worker about the error, and I don't want to see that.
If I get the result immediately, i.e. replace oid = raise_error.remote() with print(ray.get(raise_error.remote())) then I just get the traceback and I don't get the message from the worker task.
Please let me know whether this response answers your question.
@mvashishtha Thanks for the response. That matches my understanding as well.
I think the problem we need to fix is when the error is actually handled (via ray.get()) we shouldn't print a message about unhandled error.
Normally we only print the unhandled message when the error is actually not handled before it's GCed. However we do something special here for iPython due to the reason mentioned in https://github.com/ray-project/ray/pull/15432. For iPython if the error is not handled within 5s (kind of an arbitrary number), we will print the message. I can definitely see why it's not desirable for you since the error is eventually handled so it's not really an unhandled error.
We will explore possible solutions and get back to you.
Btw, if you run your program non-interactively (e.g python program.py), you shouldn't see the message.
to clarify: the error message show up when an object (with exception result) get deallocated, even if it's not in interactive mode. so following code will show the error message for both python/ipython, regardless of interactive or script mode
import ray
import time
ray.init(log_to_driver=False)
@ray.remote
def raise_error():
raise RuntimeError()
# error message show up when the result from raise_error being gced
raise_error.remote()
time.sleep(10)
the reason the original snipped doesn't print error for python but show up for ipython, seems to me to be a race condition between program shutdown and GC :/
Case 1: an exception object is GCed before it's accessed/handled, then it's an unhandled error so we will have the error message regardless of interactive or script mode. (@scv119 this is the case you described)
Case 2: an exception object is accessed/handled within 5s after it's created, then we won't have the error message in either mode.
Case 3: an exception object is accessed/handled 5s after it's created, then interactive mode will show the error message but script mode won't. (this is the case @mvashishtha encountered and we should see how to fix).