AtomVM icon indicating copy to clipboard operation
AtomVM copied to clipboard

Generate proper stacktrace when calling reraise in Elixir

Open mat-hek opened this issue 4 months ago • 5 comments

👋 @bettio @pguyot I need your help on this one, as I'm not sure if it doesn't break anything, and it relies on at least one hack, described in the comments. Also, I'm not sure how to include the reproduction (snippet below) in the tests - I can translate it to Erlang, but it requires defining several modules and each test is currently a single module AFAIK.

The problem occurs when running the Elixir compiler and can be reproduced in the following way:

defmodule Raiser do
  def raise_error() do
    raise "foo"
  end
end

defmodule IntermediateCaller do
  def call_raiser() do
    Raiser.raise_error()
    :ok
  end
end

defmodule Reraiser do
  def reraise_error() do
    try do
      IntermediateCaller.call_raiser()
    rescue
      e -> reraise e, __STACKTRACE__
    end

    :ok
  end
end

try do
  Reraiser.reraise_error()
rescue
  e -> {e, __STACKTRACE__}
end

Currently, the stacktrace returned only includes Reraiser.reraise_error/0. After the fixes in this PR, it also includes IntermediateCaller.call_raiser/0 and Raiser.raise_error/0, as it should.

I found out that the reason is that nif_erlang_raise ignores its third argument. The problem only occurs when the functions are in separate modules - when they're in one module, everything works as expected, so I suppose the compiler generates op_raise instead of nif_erlang_raise in that case (?).

These changes are made under both the "Apache 2.0" and the "GNU Lesser General Public License 2.1 or later" license terms (dual license).

SPDX-License-Identifier: Apache-2.0 OR LGPL-2.1-or-later

mat-hek avatar Aug 20 '25 08:08 mat-hek

You definitely can have a test with several modules. We already do. See catch_from_other_module.erl and raise_badmatch.erl.

pguyot avatar Aug 20 '25 20:08 pguyot

@pguyot Ok, I added the test. As it turns out, for reproduction it's necessary to wrap erlang:raise with erlang:error, as Elixir's reraise does, or basically do some calls before erlang:raise

mat-hek avatar Aug 25 '25 12:08 mat-hek

Let's try to summarize everything a little bit:

  1. I think this PR is a workaround for a simple reason: as far as I understand, registers are not meant to hold error information while returning from the function that raised the error. So it is likely building on something quite wrong.

This is likely a wrong pattern:

  70 #define RAISE(a, b)  \
  71     ctx->x[0] = (a); \
  72     ctx->x[1] = (b); \
  73     return term_invalid_term();

We should store relevant information as part of Context, so they don't get clobbered by mistake.

I think this kind of workaround, that @mat-hek did, explains pretty well the issue (I think) we have:

static bool is_exception_class(term t)
{
    return t == ERROR_ATOM || t == LOWERCASE_EXIT_ATOM || t == THROW_ATOM;
}
[...]
if (stacktrace_is_built(stacktrace)) {
    //  FIXME: This is a temporary solution as in some niche cases of reraise the x_regs[0] is
    //  overwritten and it does not represent exception class
    if (!is_exception_class(x_regs[0])) {
        x_regs[0] = ERROR_ATOM;
    }

@mat-hek by the way, do you have any code sample where this corner case is visible and reproducible?

  1. Any drawback of adding additional fields to Context?

I think we should be able to manage this adding to Context something like:

context_flags_t flags; // error, exit, throw, ... (space for more flags)
term reason;
term stacktrace;

In the future we might even add:

term value;

This can be set from VALIDATE_VALUE macro for improved debugging. Also avoiding overwriting x[0]...x[2] registers, allows us to have improved stacktraces, than can contain function arguments and so on.

  1. What we should do now?

I suggest finding if merging this PR (as a temporary workaround) is doable. Can we update this PR to JIT changes with a reasonable effort? If so, Popcorn can benefit this, since they can use an updated AtomVM. Otherwise it would be hard for them to stay on par with upstream (until we find some kind of better error management).

Also, I really think we should instead revamp error handling as soon as we have less moving parts on JIT side.

@pguyot your opinion is important here

Comments?

bettio avatar Oct 02 '25 17:10 bettio

@mat-hek by the way, do you have any code sample where this corner case is visible and reproducible?

@bettio To be clear, this problem only occurs if you bring the changes from this PR without the fix. The problem is reproducible with popcorn, see this PR. It would require running popcorn master with fissionvm without the fix. I can set it up for you if needed.

Also, this PR doesn't fix any critical bugs - it just improves generated stacktraces, thus we can run popcorn without it. It'll be harder to debug though, so it may delay abandoning fissionvm indeed.

mat-hek avatar Oct 03 '25 10:10 mat-hek

@bettio BTW, if you decide on how error handling should be eventually implemented and provide some guidance, we can try to contribute it ;)

mat-hek avatar Oct 06 '25 14:10 mat-hek