pyo3 icon indicating copy to clipboard operation
pyo3 copied to clipboard

PyO3 Sub Interpreter Broken since 0.22.0

Open Xuanwo opened this issue 1 year ago • 12 comments

Bug Description

Arrow UDF is a User-Defined Functions Framework that allows users to easily create and run user-defined functions (UDF) on Apache Arrow.

We use pyo3 to create a sub-interpreter and run users' UDFs within it to avoid the global GIL. It used to work fine, but we discovered it broke after the 0.22 release, specifically after this commit in https://github.com/PyO3/pyo3/pull/4188.

Steps to Reproduce

I've prepared a repository: https://github.com/Xuanwo/pyo3-sub-interpreter-broken

To reproduce:

git clone https://github.com/Xuanwo/pyo3-sub-interpreter-broken
cd pyo3-sub-interpreter-broken
cargo test

Backtrace

#0  0x00007d3dd69bf880 in _PyGCHead_SET_NEXT (gc=0x7d3dd5e430c8, next=0x7d3dd014dc60) at ./Include/internal/pycore_gc.h:63
#1  _PyObject_GC_UNTRACK (op=0x7d3dd5c2e0a0) at ./Include/internal/pycore_object.h:246
#2  AttributeError_dealloc (self=0x7d3dd5c2e0a0) at Objects/exceptions.c:2315
#3  0x00005618f292c140 in pyo3_ffi::object::Py_DECREF (op=0x7d3dd5c2e0a0)
    at /home/xuanwo/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-ffi-0.22.3/src/object.rs:611
#4  pyo3::gil::ReferencePool::update_counts (self=0x5618f299aed8 <pyo3::gil::POOL+16>) at src/gil.rs:311
#5  0x00005618f292c24d in pyo3::gil::GILPool::new () at src/gil.rs:424
#6  0x00005618f292b94a in pyo3::gil::GILGuard::acquire_unchecked () at src/gil.rs:229
#7  0x00005618f292b8e2 in pyo3::gil::GILGuard::acquire () at src/gil.rs:211
#8  0x00005618f28d2661 in pyo3::marker::Python::with_gil<pyo3_sub_interpreter_broken::{impl#2}::new::{closure_env#0}, ()> (f=...)
    at /home/xuanwo/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-0.22.3/src/marker.rs:410
#9  0x00005618f28d3402 in pyo3_sub_interpreter_broken::SubInterpreter::new () at src/lib.rs:42
#10 0x00005618f28d2e45 in pyo3_sub_interpreter_broken::tests::assert_err (code=...) at src/lib.rs:161
#11 0x00005618f28d2f9e in pyo3_sub_interpreter_broken::tests::test_forbid () at src/lib.rs:156
#12 0x00005618f28d2d27 in pyo3_sub_interpreter_broken::tests::test_forbid::{closure#0} () at src/lib.rs:154
#13 0x00005618f28d30d6 in core::ops::function::FnOnce::call_once<pyo3_sub_interpreter_broken::tests::test_forbid::{closure_env#0}, ()> ()
    at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/core/src/ops/function.rs:250
#14 0x00005618f290e7ab in core::ops::function::FnOnce::call_once<fn() -> core::result::Result<(), alloc::string::String>, ()> () at library/core/src/ops/function.rs:250
#15 test::__rust_begin_short_backtrace<core::result::Result<(), alloc::string::String>, fn() -> core::result::Result<(), alloc::string::String>> ()
    at library/test/src/lib.rs:624
#16 0x00005618f290e055 in test::run_test_in_process::{closure#0} () at library/test/src/lib.rs:647
#17 core::panic::unwind_safe::{impl#23}::call_once<core::result::Result<(), alloc::string::String>, test::run_test_in_process::{closure_env#0}> ()
    at library/core/src/panic/unwind_safe.rs:272
#18 std::panicking::try::do_call<core::panic::unwind_safe::AssertUnwindSafe<test::run_test_in_process::{closure_env#0}>, core::result::Result<(), alloc::string::String>> ()
    at library/std/src/panicking.rs:557
#19 std::panicking::try<core::result::Result<(), alloc::string::String>, core::panic::unwind_safe::AssertUnwindSafe<test::run_test_in_process::{closure_env#0}>> ()
    at library/std/src/panicking.rs:521
#20 std::panic::catch_unwind<core::panic::unwind_safe::AssertUnwindSafe<test::run_test_in_process::{closure_env#0}>, core::result::Result<(), alloc::string::String>> ()
    at library/std/src/panic.rs:350
#21 test::run_test_in_process () at library/test/src/lib.rs:647
#22 test::run_test::{closure#0} () at library/test/src/lib.rs:568
#23 0x00005618f28d4dd4 in test::run_test::{closure#1} () at library/test/src/lib.rs:598
#24 std::sys::backtrace::__rust_begin_short_backtrace<test::run_test::{closure_env#1}, ()> () at library/std/src/sys/backtrace.rs:152
#25 0x00005618f28d8502 in std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure#0}<test::run_test::{closure_env#1}, ()> () at library/std/src/thread/mod.rs:538
#26 core::panic::unwind_safe::{impl#23}::call_once<(), std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<test::run_test::{closure_env#1}, ()>> ()
    at library/core/src/panic/unwind_safe.rs:272
#27 std::panicking::try::do_call<core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<test::run_test::{closure_env#1}, ()>>, ()> () at library/std/src/panicking.rs:557
#28 std::panicking::try<(), core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<test::run_test::{closure_env#1}, ()>>> () at library/std/src/panicking.rs:521


### Your operating system and version

Archlinux

### Your Python version (`python --version`)

Python 3.12.6

### Your Rust version (`rustc --version`)

rustc 1.81.0 (eeb90cda1 2024-09-04)

### Your PyO3 version

> 0.22.0

### How did you install python? Did you use a virtualenv?

`pacman`, no virtualenv.

Not related.

### Additional Info

Also see: https://github.com/arrow-udf/arrow-udf/pull/70

Xuanwo avatar Sep 20 '24 12:09 Xuanwo

Furthermore, in the upcoming pyo3 release, the GILPool will be removed entirely. What alternative can we use?

Xuanwo avatar Sep 20 '24 12:09 Xuanwo

It used to work fine, but we discovered it broke after the 0.22 release, specifically after this commit in https://github.com/PyO3/pyo3/pull/4188.

Your code was always broken, this is just when you noticed it. Your code has various examples of UB, like releasing the gil when you aren't holding it, and calling python api without the gil held. Your are also not handling existing threadstate correctly, which is probably why the linked PR stopped your code from "working".

Mixing pyo3 api and subinterpreters is also not supported. My suggestion is to exclusively use pyo3-ffi. The documentation for that is at https://docs.python.org/3/c-api/init.html

mejrs avatar Sep 20 '24 13:09 mejrs

cc @wangrunji0408 do you have any ideas?

Xuanwo avatar Sep 20 '24 13:09 Xuanwo

Hi, @mejrs, thank you for your prompt response. Our use case is quite unique, as we will run users' UDFs exclusively in this sub-interpreter.

~~After some research, I found that we can use Python::with_gil_unchecked to correctly hold the GIL.~~

It seems this approach is incorrect; I didn't run that code in the sub-interpreter.

Xuanwo avatar Sep 20 '24 16:09 Xuanwo

What you are doing is inherently unsafe given PyO3's lack of support for subinterpreters, but it's possible I can help work out a subset of operations which you can do without issue. This would be entirely at your own risk, of course.

davidhewitt avatar Sep 26 '24 15:09 davidhewitt

What you are doing is inherently unsafe given PyO3's lack of support for subinterpreters, but it's possible I can help work out a subset of operations which you can do without issue. This would be entirely at your own risk, of course.

Hi, @davidhewitt, would you like to lead me a direction in finding a solution? I'm willing to help implement it in PyO3 or fix our usage.

Xuanwo avatar Oct 08 '24 09:10 Xuanwo

I would suggest looking at the problem like this:

  • sharing almost all objects between subinterpreters is unsound; PyO3's Py<T> does nothing to prevent this so if you have these objects crossing between interpreters it's probably broken. Bound<'py, T> and Borrowed<'a, 'py, T> probably allow sharing too, although the 'py lifetime will make it a bit more awkward.
  • consequently, almost all global state is probably not sound when mixed across subinterpreters. #[pyclass] currently stores the type object in global state so #[pyclass] definitely cannot be used with subinterpreters at present.
  • PyO3 also has some internal state, which we're generally working to remove, for many reasons. The GIL Refs were one example of (thread-local) global state, we also have an internal counter to detect if we hold the GIL which I suspect is also a problematic optimization.

A safe solution in PyO3 is extremely nontrivial, see #576

What we should start with, is to consider: what state are you passing into the subinterpreter? If we can isolate that to a very well defined data payload, we can probably make a very limited subinterpreter call sound.

davidhewitt avatar Oct 10 '24 06:10 davidhewitt

@davidhewitt Thank you for your suggestion. After our investigation, we found that the root cause of the issue is in the third point you mentioned—the internal GIL counter. In our code, after switching to a subinterpreter, a GILPool is created, which increments the internal counter by 1. However, after this commit, the counter remains unchanged at 0. This results in Python objects being registered in the internal queue for deferred drop instead of being dropped immediately when exiting the Python scope. When re-entering, a different subinterpreter might be used. This leads to exceptions. Unfortunately, we haven’t found any public interfaces in the latest version of PyO3 that allow us to manipulate the internal counter directly. As a result, this issue cannot be resolved for now. We are looking forward to native support for subinterpreters in PyO3 soon.

wangrunji0408 avatar Oct 10 '24 10:10 wangrunji0408

As a result, this issue cannot be resolved for now. We are looking forward to native support for subinterpreters in PyO3 soon.

From what I understand, it's a long way for PyO3 to officially support subinterpreters. I hope we can find a way to use subinterpreters unsafely for now and work together to build the final solution.

Is it possible for us to modify GILGuard or a similar API so we can manage them ourselves while switching subinterpreters?

Xuanwo avatar Oct 10 '24 12:10 Xuanwo

Is it possible for us to modify GILGuard or a similar API so we can manage them ourselves while switching subinterpreters?

What makes you think GILGuard is the problem? You should fix the bugs in your code first.

mejrs avatar Oct 10 '24 12:10 mejrs

Unfortunately, we haven’t found any public interfaces in the latest version of PyO3 that allow us to manipulate the internal counter directly. As a result, this issue cannot be resolved for now. We are looking forward to native support for subinterpreters in PyO3 soon.

You should use Python::with_gil to ensure a valid Python thread state within the subinterpreter. PyO3 will bump the internal counter as part of that call.

davidhewitt avatar Oct 11 '24 08:10 davidhewitt

I would hasten to note that the internal counter is really an implementation detail / optimization, and you should expect that what you're doing might break again with future PyO3 upgrades.

davidhewitt avatar Oct 11 '24 14:10 davidhewitt

I'm going to close this as really this is part of #576, and the exact (unsafe) stuff going on here can be discussed without needing to track an issue.

davidhewitt avatar Nov 19 '24 22:11 davidhewitt