Gymnasium icon indicating copy to clipboard operation
Gymnasium copied to clipboard

[Proposal] Full Stack Trace when exceptions thrown from vectorized environments

Open aaronwalsman opened this issue 2 years ago • 3 comments

Proposal

This is slightly related to issue #177 which I submitted recently. It would be nice to at least optionally have the vector envs print the full stack trace of exceptions that are raised in worker threads rather than just the one line exception.

See the "Pitch" below for what this looks like now, and what (I think) would be better.

I would happily submit a pull request for this if I can get the following guidance:

  1. Should this always happen? (Every time an exception is raised from an environment and not caught, the full stack trace of the exception would be printed by _raise_if_errors). I personally would vote yes this should always happen, but of course defer to the core developers. It would make errors inside the vectorized envs generate a bit more output, but I think this is preferable to cluttering up the vector env constructor with an option.
  2. If it should not always happen, what should the option be named? verbose? verbose_errors?
  3. Any other guidance?

Again, I know the internals of the VectorEnvs already changing substantially, so if this is no longer relevant, feel free to ignore this.

Motivation

This would be very useful for developers who are working on custom environments and need to debug complicated failures. In my own research I have been working around this issue by wrapping all the methods of my environments in something that intercepts the exceptions and then prints the stack trace, then raises again, but it would be nice to have that happen automatically.

Pitch

This would make it nicer to develop and debug custom environments. This would especially help with obscure errors that occur infrequently. As it is now, you may run some experimental env for hours, and then it may die halfway through with an error that doesn't help you track down the source. It can be incredibly frustrating to have to add a bunch of additional error checking and rerun the job for a long time just to get more information on why it's breaking. This would help substantially by producing a full stack-trace of every error that makes its way back to the main worker thread.

For reference, the output now looks like:

ERROR: Received the following error from Worker-6: NameError: name 'blerg' is not defined
ERROR: Shutting down Worker-6.

Preferably, it would instead output something like:

ERROR: Received the following error from Worker-6:
ERROR: Traceback (most recent call last):
  File "/media/awalsman/LEGODrive/Gymnasium/gymnasium/vector/async_vector_env.py", line 647, in _worker_shared_memory
    ) = env.step(data)
  File "/media/awalsman/LEGODrive/steadfast/steadfast/env_test.py", line 25, in step
    blerg
NameError: name 'blerg' is not defined
ERROR: Shutting down Worker-6.

Alternatives

No response

Additional context

No response

Checklist

  • [X] I have checked that there is no similar issue in the repo

aaronwalsman avatar Dec 05 '22 00:12 aaronwalsman

I think this is a good idea and am not sure why this wasn't already part of the implementation. @aaronwalsman We are in the process to rewriting the whole VectorEnv for the next release and has already started behind the scenes. Would you be able to join the discord server (https://discord.gg/bnJ6kubTg6) and message me (PseudoRnd), we can discuss it more

pseudo-rnd-thoughts avatar Dec 05 '22 11:12 pseudo-rnd-thoughts

@aaronwalsman I'm looking through old issues and can't remember if we finished this. I think it should be possible to pass the exception object not just the exception message to the base async class, from there we can raise exception rather than raises Exception(message)

pseudo-rnd-thoughts avatar Aug 25 '23 09:08 pseudo-rnd-thoughts

Hi, I don't think we finished this. Sorry I'm finishing my PhD soon and things have gotten hectic. I seem to recall in some testing that passing the actual exception back through the pipe didn't work, or caused some other issues. I think the problem is that the exception keeps around a bunch of thread-specific memory or something... also, I don't think you can raise my_exception if my_exception comes from a different thread. Also also, I think it makes sense to print them individually because you may have more than one thread fail at a time, so I like the thing where the message is printed and bookended by something that identifies which thread died. That way if three things die simultaneously, you see all of them, and then the main error that's raised by the VectorEnv just explains that one or more threads died, and you should check the errors above.

aaronwalsman avatar Aug 28 '23 17:08 aaronwalsman