pybind11 icon indicating copy to clipboard operation
pybind11 copied to clipboard

[BUG]: pybind11 2.10.0 - error_fetch_and_normalize changes conflicts with PyPy behavior for PyErr_NormalizeException

Open jbarlow83 opened this issue 1 year ago • 15 comments

Required prerequisites

  • [X] Make sure you've read the documentation. Your issue may be addressed there.
  • [X] Search the issue tracker and Discussions to verify that this hasn't already been reported. +1 or comment there if it has.
  • [X] Consider asking first in the Gitter chat room or in a Discussion.

Problem description

PyPy 3.8 triggers the newly introduced error message

https://github.com/pybind/pybind11/blob/59f03ee389c283cde65bd800c8f32ea690daf3fd/include/pybind11/pytypes.h#L459-L469

MISMATCH of original and normalized active exception types: ORIGINAL OSError REPLACED BY FileNotFoundError: [Errno 2] No such file or directory: 'this_file_does_not_exist'

In PyPy, PyErr_NormalizeException causes the "raw" OSError to be normalized into FileNotFoundError. Which seems fair enough.

It's unclear to me whether PyPy is doing something "naughty" here that violates the expectations set by CPython, or if this new exception handling code is just imposing an expectation that doesn't hold for PyPy. As such I don't know if simply disabling that test is appropriate for PyPy.

CPython does not make the same substitution. There is no error (and the code behaves correctly) with pybind11 2.9.

Reproducible example code

    m.def("test_pypy_oserror_normalization", []() {
        // We want to do "with suppress(FileNotFoundError): open(f)" in C++
        try {
            py::module_::import("io").attr("open")("this_file_does_not_exist", "r");
        } catch (const py::error_already_set &e) {
            // FileNotFoundError
        }
    });

jbarlow83 avatar Jul 17 '22 03:07 jbarlow83

Ping @rwgk

Skylion007 avatar Jul 17 '22 04:07 Skylion007

As such I don't know if simply disabling that test is appropriate for PyPy.

Do you mean #ifdefing out the quoted code? That would break a unit test or two, but the pybind11 test code is already badly cluttered with xfail for PyPy, a couple more is fine with me.

The main reason for the code quoted in the OP is to catch "errors in the error handling" as close to the root cause as possible. See the corresponding unit test:

https://github.com/pybind/pybind11/blob/59f03ee389c283cde65bd800c8f32ea690daf3fd/tests/test_exceptions.py#L311

Unfortunately there is no good way to know if the change of exception type is the result of a bug, or working as intended.

My opinion: changing the exception type during normalization is evil. If something is counting on the exception to get normalized, it can do so straightaway. Otherwise, client code will have to accommodate two exception types for the same exception, depending on the control flow. I feel that's like pulling out the rug from underneath people's feet. — Another line of thought: deferred normalization was meant to be an optimization. An optimization is an optimization only if the rest of the logic/control flow works identically with and without.

From the perspective of someone wanting to keep a large-scale environment healthy, I'm very reluctant to remove the quoted code. It is far more likely to catch bugs than to block something good and valid that could not easily be achieved differently. I believe, in this particular case, what's in the interest of keeping a large-scale environment healthy, is also in the interest of the world at large. The quoted code helps catching bugs closer to the root cause, saving a lot of time debugging.

rwgk avatar Jul 17 '22 19:07 rwgk

I read your comment and examined the code for PyErr_NormalizeException in both CPython and PyPy.

In CPython, PyErr_NormalizeException permits an exception to change its class when normalizing. Nothing about it is treated as an error or warning; it's permitted. After all, any class that implements __new__ might return a different object than its expected instance.

https://github.com/python/cpython/blob/07aeb7405ea42729b95ecae225f1d96a4aea5121/Python/errors.c#L333-369

I agree it's anti-pattern to change the exception type, but I don't think pybind11 is the right tool to be enforcing this rule, especially by treating as a fatal error. It happens to be, as far as I can tell, an implementation detail that PyPy uses NormalizeException to transform OSError into its specific subclasses - CPython doesn't for that code path, but there may be other places where the standard library does similar things. It happens to be the first case that cropped up, but there's a lot of third party Python and C extension libraries that may depend on NormalizeException working the way it normally does in CPython, and will be tripped up if it happens to pass through the pybind11 exception handler.

I think we can also agree that it's better to avoid #ifdef PYPY in pybind11 when it's avoidable, and I think there may be a way:

  • Permit exceptions to change to a subclass of the original type after normalization (which allows pybind11 to do its thing)
  • Fail if the exception type changes to an unrelated class, since this case is more likely to be a bug

jbarlow83 avatar Jul 17 '22 22:07 jbarlow83

pybind11 is the right tool to be enforcing this rule, especially by treating as a fatal error

Maybe there is a misunderstanding, or another bug in your environment (PyPy)? — It's not fatal, but a std::runtime_error that you can catch. See unit test, which does exactly that:

https://github.com/pybind/pybind11/blob/59f03ee389c283cde65bd800c8f32ea690daf3fd/tests/test_exceptions.py#L308

Oh!

https://github.com/pybind/pybind11/blob/59f03ee389c283cde65bd800c8f32ea690daf3fd/tests/test_exceptions.py#L306

Of course, it's right there!

Note that I ran Google-global testing, exercising 800+ pybind11 extensions many (maybe most, idk) 3rd party, usually that is very conclusive. We've been using PR #1985 Google-internally since June 2 and there we no complaints at all. Let's wait a couple more weeks to see how many more complaints we get. If there is too much pushback, we could make this opt-in vs always enabled. At the moment this bug is the only complaint I know about.

  • Permit exceptions to change to a subclass of the original type after normalization (which allows pybind11 to do its thing)
  • Fail if the exception type changes to an unrelated class, since this case is more likely to be a bug

Sounds good. Do you want to try?

Nothing about it is treated as an error or warning; it's permitted.

I believe they didn't think it through. They wanted to optimize, but forgot to handle the distinction between "accident/bug" and "intent". While working on #1895 I searched the documentation for PyErr_Occurred() conditions, then the source code, and was very surprised to see that that is not handled at all. "... optimization is the root of all evil." In this case not so much "premature" (Knuth), but "rushed", because they opened the door wide to masking errors in the error handling, which I can tell you from first-hand experience, can be a terrible time sink; weeks worth lost of my time, and that's just me.

rwgk avatar Jul 18 '22 00:07 rwgk

FYI: I'm testing #4079 , but it's late here. I'll try to finish the PR asap and will explain then.

rwgk avatar Jul 18 '22 08:07 rwgk

pybind11 is the right tool to be enforcing this rule, especially by treating as a fatal error

Maybe there is a misunderstanding, or another bug in your environment (PyPy)? — It's not fatal, but a std::runtime_error that you can catch. See unit test, which does exactly that:

https://github.com/pybind/pybind11/blob/59f03ee389c283cde65bd800c8f32ea690daf3fd/tests/test_exceptions.py#L308

Oh!

https://github.com/pybind/pybind11/blob/59f03ee389c283cde65bd800c8f32ea690daf3fd/tests/test_exceptions.py#L306

Of course, it's right there!

No, false!

The claim that we are treating a change in exception type "as a fatal error" completely got me on the wrong track. It turns out to be a clean RuntimeError in all PyPy CI jobs, that preserves all the existing information, but explains that the exception type has changed:

https://github.com/pybind/pybind11/runs/7385124173

You could adjust to that by catching RuntimeError in addition to FileNotFoundError. That's a very small inconvenience compared to the huge gain in safety everywhere else (original errors are no longer masked by errors in the error handling).

If PyPy is then changed to normalize the error immediately, instead of relying on pybind11 to do that for PyPy, the world at large will be in a better state than it was before.

Working on this bug reminded me that this

https://github.com/pybind/pybind11/blob/59f03ee389c283cde65bd800c8f32ea690daf3fd/tests/test_exceptions.py#L306

was needed even before #1895 was merged. Fixing the root cause for that segfault would definitely be a good thing. As-is, PyPy appears to mask errors in the error handling with a segfault.

rwgk avatar Jul 18 '22 08:07 rwgk

Apologies about the "fatal error" comment - I mistook pybind_fail for meaning that the interpreter is terminated.

For pybind11 2.10 replaces an OSError from PyPy with an RuntimeError - I can't agree this is a clean solution. It replaces with the expected exception, an OSError, with a RuntimeError. That is an API breaking change and a regression from 2.9. I don't think it's reasonable to ask client code to introduce special, hacky, PyPy-specifc exception handling everywhere until PyPy addresses hte underlying issue.

I agree with it does make sense to break client code when it's clearly doing something wrong, dangerous and most importantly fixable by client code, but that's not the case here. The code that's presumably broken is in PyPy (or perhaps cpyext). That segfault is also worrying. But client code can't fix the fundamental issues since those issues aren't in client code. I don't think it makes sense to punish client code over the issue.

Everyone who supports PyPy also supports CPython, so any code is raising denormalized polymorphic exceptions it will get flagged by the MISMATCH test on the CPython side. That will prove safety and allow PyPy to work.

I think that leaves two options:

  • report the issue to PyPy/cpyext and for now, use an #ifdef to drop the MISMATCH test from PyPy compiles
  • disable PyPy support from pybind11 if we're that uncomfortable with its software engineering decisions around exceptions

jbarlow83 avatar Jul 19 '22 07:07 jbarlow83

@rwgk Could this segfault be related to the fact PyErr_NormalizeException decrefs the last two arguments, and we don't inrec them? See this comment: https://github.com/pybind/pybind11/pull/1895#issuecomment-526273100 . It also demonstrates the very small perf penalty of normalizing the eagerly if that would solve the issue.

Skylion007 avatar Jul 19 '22 15:07 Skylion007

@rwgk Could this segfault be related to the fact PyErr_NormalizeException decrefs the last two arguments, and we don't inrec them?

Hm... I think we have been calling PyErr_NormalizeException correctly, before and after PR #1895, that aspect didn't change. Any manipulations of the refcount would be PyPy workarounds. I'm not sure it's worth the time experimenting with "something that avoids the PyPy segfault but does not introduce refcount leaks". Someone with a vested interested in PyPy (not me) debugging PyPy would seem like a better use of collective time.

See this comment: #1895 (comment) . It also demonstrates the very small perf penalty of normalizing the eagerly if that would solve the issue.

There was a lot of back and forth on this (even just in my own thinking, but also between reviewers), but in the end we settled on immediate normalization:

https://github.com/pybind/pybind11/blob/ef7d971e036e4675ca266b2c990f0fd65c99e360/include/pybind11/pytypes.h#L425-L430

rwgk avatar Jul 19 '22 19:07 rwgk

I think that leaves two options:

  • report the issue to PyPy/cpyext and for now, use an #ifdef to drop the MISMATCH test from PyPy compiles

Yes. I'll do the #ifdef part, but hope someone else will do the reporting to PyPy.

  • disable PyPy support from pybind11 if we're that uncomfortable with its software engineering decisions around exceptions

No! :-)

My interest: guard against masking errors in the error handling, because I know from direct experience how bad that can be ("extremely expensive to debug").

But I don't want to break anything that doesn't share that interest, if I can help it, which I believe the #ifdef will do.

For pybind11 2.10 replaces an OSError from PyPy with an RuntimeError - I can't agree this is a clean solution.

I never meant to suggest it is a clean solution, I know it is not. The root of the problem is the missing intent vs bug distinction in PyErr_NormalizeException, leaving us stuck between a rock and a hard place:

  • Original errors getting masked by errors in the error handling (see unit test) == extremely expensive debugging that, in a large scale environment, or the world as a whole, will happen at a certain rate.
  • Turn a change of exception type into a noisy failure.
  • Or something smarter as you suggested before (inheritance), but that's a lot of added complexity, compared to pushing back with the noisy failure, which is basically a way of saying: please normalize such cases immediately yourself (rather than relying on pybind11 to do it for you).

Of course that's no good for someone merely wanting to use pybind11 & PyPy in combination, but not wanting to get into PyPy to add the immediate normalization there. I think the #ifdef is a good compromise.

rwgk avatar Jul 19 '22 20:07 rwgk

Sorry for delay in respond. LGTM.

I'll submit a quick PR to clarify the comments many types of OSError may do this, not just FileNotFoundError.

jbarlow83 avatar Jul 22 '22 03:07 jbarlow83

Reported to PyPy over here. https://foss.heptapod.net/pypy/pypy/-/issues/3786

jbarlow83 avatar Jul 22 '22 03:07 jbarlow83

Hi. I am really confused. It seems there are at least two issues here:

  • Some C-API call to open a file is setting an OSError on PyPY and a FileNotFoundError on CPython
  • pybind11 calls PyErr_NormalizeException and there is some inconsistency in what PyPy does compared to what CPython does.

Do I understand correctly up to here? If so read on. If not please correct me.

The first one is a PyPy bug, and PyPy (i.e. me) should fix it. Do you know what C-API call is being used?

The second one seems to be that PyPy's implementation of PyErr_NormalizeException will unconditionally create a new error object from type and value, and does not follow the "If the values are already normalized, nothing happens" part of the spec. Is that right?

mattip avatar Jul 22 '22 06:07 mattip

Hi. I am really confused. It seems there are at least two issues here:

  • Some C-API call to open a file is setting an OSError on PyPY and a FileNotFoundError on CPython

I'm not too sure about the exact internal steps, too, but yes, at the moment when py::error_already_set is constructed:

  • C-Python exc_type is FileNotFoundError
  • PyPy exc_type is OSError

The py::error_already_set constructor calls PyErr_Fetch(), followed immediately by PyErr_NormalizeException(). After the latter:

  • C-Python exc_type is still FileNotFoundError (unchanged)
  • PyPy exc_type is still FileNotFoundError (changed)

It was "always" that way, but PR #1895 changed the behavior of pybind11, for safety. It now does this:

https://github.com/pybind/pybind11/blob/9a2963734de309a6d3f4ffb3e43bb56713a7ccae/include/pybind11/pytypes.h#L483-L485

Instead of the FileNotFoundError, a RuntimeError with that message is raised, explaining what happened. This is meant to catch bugs. See prior comments here for more background.

The first one is a PyPy bug, and PyPy (i.e. me) should fix it. Do you know what C-API call is being used?

I never looked deeper than what I described above.

The second one seems to be that PyPy's implementation of PyErr_NormalizeException will unconditionally create a new error object from type and value, and does not follow the "If the values are already normalized, nothing happens" part of the spec. Is that right?

I don't think that ever was a problem for pybind11, before and after PR #1895.

From my limited understanding / biased viewpoint: PyPy seems to "exploit" PyErr_NormalizeException() to arrive at FileNotFoundError. If PyPy was changed to call PyErr_NormalizeException() "as soon as possible", and not rely on pybind11 to do that as a side-effect (of using py::error_already_set in the processing of the error), everything would be fine.


A related but separate PyPy problem that existed before and still exists after PR #1895:

https://github.com/pybind/pybind11/blob/master/tests/test_exceptions.py#L306

It would be great for PyPy users if that got fixed. As-is, errors in the error handling (the unit test) lead to a PyPy segfault. I'm guessing that could be very time consuming to debug, especially if the error condition is difficult to reproduce, which I've seen quite often for production jobs (as a general observation, I don't think Google is using PyPy in production).

rwgk avatar Jul 22 '22 07:07 rwgk

PyPy now emits the proper normalized error in nightlies, and this will be part of the 7.3.10 release. Thanks for pointing it out.

mattip avatar Jul 26 '22 16:07 mattip

I added a PyPy-nightly run to the binary-run repo where I test c-extension packages. ubuntu and windows are passing, macos is failing this test. I wonder what is different about macOS. Using clang instead of gcc/MSVC maybe?

mattip avatar Oct 04 '22 12:10 mattip

I added a PyPy-nightly run to the binary-run repo where I test c-extension packages. ubuntu and windows are passing, macos is failing this test. I wonder what is different about macOS. Using clang instead of gcc/MSVC maybe?

Yes, most likely: I've only ever seen clang being used when testing pybind11 on macOS. Maybe related: Apple's versioning doesn't seem to match "normal" clang versioning. (It's one of those things I've been confused about but never got a chance to drill down.)

rwgk avatar Oct 04 '22 14:10 rwgk

@rwgk Apple's version = Clang version - 2, usually.

Skylion007 avatar Oct 04 '22 17:10 Skylion007

@rwgk Apple's version = Clang version - 2, usually.

Interesting ... thanks!

rwgk avatar Oct 04 '22 17:10 rwgk