ray icon indicating copy to clipboard operation
ray copied to clipboard

[Core] log_to_driver=False does not suppress worker errors in ipython

Open mvashishtha opened this issue 3 years ago • 8 comments

What happened + What you expected to happen

When a ray task raises an error in an ipython session, I get an "Unhandled error" message in my session output after some delay (about 1 second). If I initialize ray with log_to_driver=False I still get the error. Setting environment variable RAY_IGNORE_UNHANDLED_ERRORS=1 does fix the error, but I think that log_to_driver should suppress all ray task errors. I know it suppresses warnings.

Error message:

2022-08-31 14:39:17,467	ERROR worker.py:399 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::raise_error() (pid=55178, ip=127.0.0.1)
  File "<ipython-input-1-4cb397045297>", line 8, in raise_error
RuntimeError

This error is relevant to Modin, which saves object ids from remote function calls. Modin users in ipython get spammed with ray task errors in addition to getting the nicer exception in the main thread caused by finally ray.geting the oid from the task that raised the error. It would be excellent for the Modin user experience if there were a way to suppress the errors. RAY_IGNORE_UNHANDLED_ERRORS=1 doesn't work because we shouldn't update the user's environment. Suppressing errors with log_to_driver=False would be great.

Versions / Dependencies

ray 2.0.0 macOS Monterey 12.4 python 3.10.4 ipython 8.4.0

Reproduction script

import ray
import time

ray.init(log_to_driver=False)

@ray.remote
def raise_error():
  raise RuntimeError()

oid = raise_error.remote()
time.sleep(5)

Issue Severity

Low: It annoys or frustrates me.

mvashishtha avatar Aug 31 '22 19:08 mvashishtha

Having the same issue and this is very annoying. log_to_driver=False does nothing and the console is full of ray system logs making it impossible to see the logs emitted by application code. How to disable Ray system logs??

ankur6ue avatar Sep 05 '22 16:09 ankur6ue

@simon-mo could you please take a look? I marked the priority as "low" because the issue doesn't affect correctness or performance, but the issue does annoy me a lot.

mvashishtha avatar Sep 06 '22 15:09 mvashishtha

@mvashishtha thanks for the ping. This is partially the expected behavior because log_to_driver is about your (user side) worker's print and logs. The unhandled error is propagated through the Ray system logs. There might be a place to silence it, or apply this flag. @scv119 can you help triage?

simon-mo avatar Sep 06 '22 18:09 simon-mo

@scv119 could you please take a look?

mvashishtha avatar Sep 08 '22 16:09 mvashishtha

@scv119 could you please take a look?

In my opinion the worker errors are significantly degrading the experience of using Modin on Ray.

mvashishtha avatar Sep 19 '22 17:09 mvashishtha

did some investigation. as @simon-mo mentioned this seems to be the right behavior. You can suppress the warning by setting os.environ["RAY_IGNORE_UNHANDLED_ERRORS"]="1" before ray.init

scv119 avatar Sep 21 '22 16:09 scv119

@scv119 why do you say this is the right behavior? The documentation of log_to_driver says "If true, the output from all of the worker processes on all nodes will be directed to the driver.". Error output from nodes is output, just as warnings are. I think the right behavior would be to suppress the error output.

mvashishtha avatar Sep 21 '22 17:09 mvashishtha

@mvashishtha in Ray, unhandled errors are treated different from output.

Specifically, if a task throws an unhandled error, the error message will show up in the output of the caller. In your case, the caller happened to be in the driver, so log_to_driver=False doesn't suppress the message. Instead, RAY_IGNORE_UNHANDLED_ERRORS env is intended to suppress unhandled errors show up in the output of the caller.

I think there is an option to create a different flag to only control this behavior (i.e. don't print unhandled error into output) only for driver... but that's a public API change. @rkooo567 @jjyao thought?

scv119 avatar Sep 21 '22 23:09 scv119

Some context: https://github.com/ray-project/ray/pull/15432.

@mvashishtha want to make sure I understand it correctly: are you saying those errors are eventually handled by calling ray.get(), it just happens a while (5s) after the error is generated? Is my understanding correct.

jjyao avatar Sep 22 '22 18:09 jjyao

@jjyao this error happens when object is being GCed, even ray.get() is not called.

scv119 avatar Sep 22 '22 18:09 scv119

IIUC, currently the error will be printed out even if we handle the error (e.g. ray.get()) but just not fast enough. I want to confirm if this is the case for them.

jjyao avatar Sep 22 '22 19:09 jjyao

@jjyao in my original script, if I add ray.get(oid) to the end of the script, I get a nice stack trace from the handled error:

Stack trace
---------------------------------------------------------------------------
RayTaskError(RuntimeError)                Traceback (most recent call last)
Input In [2], in <cell line: 1>()
----> 1 ray.get(oid)

File ~/opt/anaconda3/envs/modin-dev/lib/python3.10/site-packages/ray/_private/client_mode_hook.py:105, in client_mode_hook.<locals>.wrapper(*args, **kwargs)
    103     if func.__name__ != "init" or is_client_mode_enabled_by_default:
    104         return getattr(ray, func.__name__)(*args, **kwargs)
--> 105 return func(*args, **kwargs)

File ~/opt/anaconda3/envs/modin-dev/lib/python3.10/site-packages/ray/_private/worker.py:2275, in get(object_refs, timeout)
   2273     worker.core_worker.dump_object_store_memory_usage()
   2274 if isinstance(value, RayTaskError):
-> 2275     raise value.as_instanceof_cause()
   2276 else:
   2277     raise value

RayTaskError(RuntimeError): ray::raise_error() (pid=50904, ip=127.0.0.1)
  File "<ipython-input-1-4cb397045297>", line 8, in raise_error
RuntimeError

but I also get a message from the worker about the error, and I don't want to see that.

If I get the result immediately, i.e. replace oid = raise_error.remote() with print(ray.get(raise_error.remote())) then I just get the traceback and I don't get the message from the worker task.

Please let me know whether this response answers your question.

mvashishtha avatar Sep 22 '22 19:09 mvashishtha

@mvashishtha Thanks for the response. That matches my understanding as well.

I think the problem we need to fix is when the error is actually handled (via ray.get()) we shouldn't print a message about unhandled error.

Normally we only print the unhandled message when the error is actually not handled before it's GCed. However we do something special here for iPython due to the reason mentioned in https://github.com/ray-project/ray/pull/15432. For iPython if the error is not handled within 5s (kind of an arbitrary number), we will print the message. I can definitely see why it's not desirable for you since the error is eventually handled so it's not really an unhandled error.

We will explore possible solutions and get back to you.

jjyao avatar Sep 22 '22 19:09 jjyao

Btw, if you run your program non-interactively (e.g python program.py), you shouldn't see the message.

jjyao avatar Sep 22 '22 19:09 jjyao

to clarify: the error message show up when an object (with exception result) get deallocated, even if it's not in interactive mode. so following code will show the error message for both python/ipython, regardless of interactive or script mode

import ray
import time

ray.init(log_to_driver=False)

@ray.remote
def raise_error():
  raise RuntimeError()

# error message show up when the result from raise_error being gced
raise_error.remote()
time.sleep(10)

the reason the original snipped doesn't print error for python but show up for ipython, seems to me to be a race condition between program shutdown and GC :/

scv119 avatar Sep 23 '22 00:09 scv119

Case 1: an exception object is GCed before it's accessed/handled, then it's an unhandled error so we will have the error message regardless of interactive or script mode. (@scv119 this is the case you described)

Case 2: an exception object is accessed/handled within 5s after it's created, then we won't have the error message in either mode.

Case 3: an exception object is accessed/handled 5s after it's created, then interactive mode will show the error message but script mode won't. (this is the case @mvashishtha encountered and we should see how to fix).

jjyao avatar Sep 23 '22 03:09 jjyao