pyo3 `Python::allow_threads` is unsound in the presence of `scoped-tls`.

trafficstars

In analogy to

#2141

we can smuggle Ungil data across Python::allow_threads using scoped-tls, too.

use pyo3::prelude::*;
use pyo3::types::PyString;
use scoped_tls::scoped_thread_local;

fn main() {
    Python::with_gil(|py| {
        let string = PyString::new(py, "foo");

        scoped_thread_local!(static WRAPPED: PyString);

        WRAPPED.set(string, || {
            py.allow_threads(|| {
                WRAPPED.with(|smuggled: &PyString| {
                    println!("{:?}", smuggled);
                });
            });
        });
    });
}

(results in segfault in my test)

Unlike #2141, this issue is virtually unsolvable, i.e. even the auto trait approach with the feature="nightly" enabled cannot catch this at all. There’s no property in the callback to allow_threads that can catch this. Really, the callback doesn’t even capture anything at all:

use pyo3::prelude::*;
use pyo3::types::PyString;
use scoped_tls::scoped_thread_local;

scoped_thread_local!(static WRAPPED: PyString);

fn callback() {
    WRAPPED.with(|smuggled: &PyString| {
        println!("{:?}", smuggled);
    });
}

fn main() {
    Python::with_gil(|py| {
        let string = PyString::new(py, "foo");

        WRAPPED.set(string, || {
            py.allow_threads(callback); // callback is an ordinary `fn() -> ()` item.
        });
    });
}

Again, there’s a “whose fault” question to be asked, whether the fact that the standard library’s thread-locals require T: 'static means there’s any guarantees that non-'static data isn’t allowed to be “smuggled” through thread-local storage. In my view, if you look at what kind of things it offers, scoped-tls seems even less unreasonable to be called “sound”, compared to sync_wrapper. Yet the consequences for Ungil are more detrimental.

Dec 11 '23 13:12 steffahn

From a first glance, I fear that saving the current approach might mean to never ever expose any GIL-bound references no matter how short-lived and have all operations explicitly require a GIL token. This would be highly unergonomic and imply that our upcoming GIL-bound smart pointers Py2<'py, T> would provide no benefit over their existing unbound counterpart Py<T> at all.

If we somehow could avoid at least storing our GIL-bound smart pointers within scoped-tls, but that library has no traits bounds at all. So I suspect this is coming down to either abandoning a safe allow_threads or socially establish that

whether the fact that the standard library’s thread-locals require T: 'static means there’s any guarantees that non-'static data isn’t allowed to be “smuggled” through thread-local storage.

should be upheld. Or at least require an extra unsafe trait bound for usage of types with scoped-tls.

In my view, if you look at what kind of things it offers, scoped-tls seems even less unreasonable to be called “sound”, compared to sync_wrapper. Yet the consequences for Ungil are more detrimental.

Better yet, there would be a case of unsoundness implied by usage of scoped-tls that does not involve the admittedly somewhat baroque architecture we built to accommodate Python in its GIL.

@steffahn Since you are obviously experienced in the finding these holes, did you ever look into scoped-tls from a blank slate perspective?

Dec 11 '23 14:12 adamreichold

From a first glance, I fear that saving the current approach might mean to never ever expose any GIL-bound references no matter how short-lived and have all operations explicitly require a GIL token. This would be highly unergonomic and imply that our upcoming GIL-bound smart pointers Py2<'py, T> would provide no benefit over their existing unbound counterpart Py<T> at all.

But this is not true, right? Because I could also use scoped-tls to smuggle the GIL token itself even if it has a generative lifetime attached?

Dec 11 '23 14:12 adamreichold

But this is not true, right? Because I could also use scoped-tls to smuggle the GIL token itself even if it has a generative lifetime attached?

Yes. At least in the extended version provided in the crate scoped-tls-hkt.

Dec 11 '23 14:12 steffahn

use pyo3::prelude::*;
use scoped_tls_hkt::scoped_thread_local;

fn main() {
    Python::with_gil(|py| {
        scoped_thread_local!(static GIL: for<'a> Python<'a>);
        GIL.set(py, || {
            py.allow_threads(|| {
                GIL.with(|smuggled_py: Python<'_>| {
                    // profit
                });
            });
        });
    });
}

Dec 11 '23 14:12 steffahn

Ouch, this is painful. Thanks @steffahn for finding such an interesting case!

My feeling here is that this is a much more severe mechanism to break allow_threads than the interaction with send_wrapper. At least with send_wrapper any smuggled data is part of the closure capture, so careful inspection around the crash could in theory identify the issue. This scoped-tls approach can enable you to have the smuggling be located in a completely separate part of the codebase to the allow_threads call, I think it would be much harder to resolve this.

It seems like the natural reaction is to declare allow_threads an unsafe function, but I think this invariant "you must be sure that scoped-tls and send-wrapper are not used to smuggle a Python marker into code invoked inside the closure" becomes extremely hard to uphold. Still, at this point it's likely a necessary change unless we can find a way to resolve the API.

What would a "safe" solution look like? Can we somehow express Python as a borrow and py.allow_threads takes an exclusive borrow, thus preventing all gil-attached data and smuggling? Maybe, but I suspect that at least some of PyO3's existing APIs need changing to achieve this.

Alternatively, a future approach might involve contexts, where allow_threads removes the context, but this is obviously not available to us at present.

Dec 12 '23 10:12 davidhewitt

What would a "safe" solution look like? Can we somehow express Python as a borrow and py.allow_threads takes an exclusive borrow, thus preventing all gil-attached data and smuggling? Maybe, but I suspect that at least some of PyO3's existing APIs need changing to achieve this.

I think this would basically be equivalent to removing the Clone and Copy impls for Python and making it invariant over the lifetime? But I suspect especially the invariance part (requiring a "reborrowing" operation) would be quite painful from an ergonomics point of view?

Dec 12 '23 10:12 adamreichold

At least that I think, yes. I think it might also have implications for .py() methods to get a copy of Python, unless they are also expressed as borrows on &Python (or similar), but I haven't thought this through.

My feeling is that coming up with a workable proposal for that is going to need someone to dive deep into this for at least a few days.

How do you feel about accepting Python::allow_threads as unsafe in the short-mid term? Hopefully we can come up with a long-term solution to make it safe again.

Do we think this has implications for the proposed Py2<'_, PyAny> smart pointer? I think if we make allow_threads unsafe then there's no safety concerns about that API, and we are already aware that being able to assume the GIL is held is a significant ergonomics win, especially in trait implementations, so I believe that API still makes sense.

Dec 12 '23 10:12 davidhewitt

(Given the Py2<'_, PyAny> change is mostly motivated by getting rid of the pool, and thus leading users to have proper treatment of ownership in their code, I think that it's a correct evolution of the API even if this issue might imply that it will eventually need to change again.)

Dec 12 '23 10:12 davidhewitt

How do you feel about accepting Python::allow_threads as unsafe in the short-mid term? Hopefully we can come up with a long-term solution to make it safe again.

My gut feeling is to not rush this (after all, we having been living with the send_wrapper soundness hole for quite some time). The reasons I see for keeping the status quo for now are:

I am not sure whether scoped-tls poses a problem in itself, not just for PyO3. But I have not yet had time to look into constructing an issue without involving PyO3.
As you wrote, it is basically impossible to check the required invariants if scoped-tls is involved due to the non-local reasoning required. (This non-locality is basically where my gut feeling that scoped-tls is generally a bad idea comes from.)
I fear that marking allow_threads unsafe will "infect" a lot of code due to the closure-based API by effectively making the closure body an unsafe block as well, c.f. this playground. Hence, I think the resulting API will lead to more actual errors than the existing soundness issues.

Dec 12 '23 11:12 adamreichold

Sure thing. The last point in particular that unsafe { } will infect the closure is a very good point which I had not considered, agreed that can also create problems.

Dec 12 '23 11:12 davidhewitt

Just for the sake of throwing out another middle-of-the-road solution, what if the method was renamed to unsafe_allow_threads for now? That would avoid the infective problem while also establishing some sort of social contract via the use of the term "unsafe".

I can see a very mixed reaction from the community to doing that (e.g. we might be accused of shipping code which doesn't trip #![forbid(unsafe_code)]), but it might be a practical choice.

Dec 12 '23 11:12 davidhewitt

FYI, I came across this issue from a context of discussing the role of Send/Sync on IRLO. Asynchronous tasks have a similar problem as Ungil, as they would love to behave more like threads in putting up Send/Sync-like barriers, in order to allow non-thread-safe API (like Cell or Rc) to be used task-internally. But thread-locals (in that case thread-locals in general, not just scoped ones) are a major problem for that.

In my reply in that thread I argued that a Send-like trait for non-physical-thread-based memory boundaries, i.e. one that cannot be circumvented thread-internally with any form of thread locals, would be sufficient to offer a sound allow_threads, too. Arguably, one that has the same issue of overly cautiously requiring non-thread-safe data not to pass into the closure at all, not just non-python-memory-involving data, but at least it could be sound.

Of course that’s not any form of short-term solution; landing such a language feature would be a huge undertaking, unless we find a simpler way of keeping it backwards compatible.

Regarding invariance as a solution, look at this comment and my answer, which is an essentially-invariant situation. Invariant lifetimes can always be hidden behind a shorter covariant one using trait objects.

Regarding the unsafe scoping, I think that’s an unfortunate effect of current unsafe block syntax. One can employ workarounds such as requiring an unsafe-to-construct token to be passed to allow_threads like

py.allow_threads(unsafe { marker_name_tbd() }, || { … })

Dec 12 '23 11:12 steffahn

Thank you, I'll try to digest that thread in the near future!

The unsafe-to-construct token is probably the better workaround than my unsafe_allow_threads suggestion 😂

Dec 12 '23 11:12 davidhewitt

Ok, one most likely idiotic idea. Thread-locals in whatever form currently break the isolation we require by allow_threads. Could we try to side step this by always executing the closure passed to allow_threads on a separate runtime thread (sufficiently long-lived to amortize the startup cost)?

Since we (at least on stable) already require the closure to be Send, shouldn't this plug both soundness holes? Of course, most likely at prohibitive performance cost...

EDIT: Maybe we would need a completely need thread for soundness? Or we would need to track whether the GIL was ever acquired on these runtime threads?

Dec 12 '23 12:12 adamreichold

EDIT: Maybe we would need a completely need thread for soundness? Or we would need to track whether the GIL was ever acquired on these runtime threads?

Probably enought to track whether GIL is currently acquired on such a runtime thread, not ever acquired. However, for this notion of “currently required” we would need to avoid unlocking via allow_threads. One could keep the required thread-pool for this fairly small, assuming a limited number of nestings of with_gil and allow_threads calls. Hopefully with a small constant, in most practical code? – well, or is this an entirely unreasonable assumptions and there are good use-cases that make use of allow_threads and with_gil nested in each other in deep recursions?

If with_gil required Send, too, one could just make two threads play ping-pong with callbacks between with_gil and allow_threads, so that we’d just need at most a single helper thread per user-created thread. Of course, making with_gil more expensive might be way worse than making allow_threads more expensive, as for the latter, at least, one could argue that it only ever is called when there’s a benefit to be expected from context switching, i.e. it shouldn’t happen for trivially short interactions, anyways, limiting the impact of further overhead.

Dec 12 '23 14:12 steffahn

I think it's actually quite a nice idea as a way to add a safe allow_threads mechanism which really does rely on Send. The performance cost I think we could measure and decide how much it is.

As moving work to a different thread seems like a semantic change, one option is to introduce this as a new safe API e.g. py.without_gil (better name welcome) and keep the old allow_threads around with some sort of unsafe hint to let users decide what's appropriate for their own code.

Dec 12 '23 14:12 davidhewitt

As moving work to a different thread seems like a semantic change, one option is to introduce this as a new safe API e.g. py.without_gil (better name welcome) and keep the old allow_threads around with some sort of unsafe hint to let users decide what's appropriate for their own code.

I think we should turn this around, make Python::allow_threads unconditionally safe by using another thread (as we already require Send, this must be sound even it is a behavioural change) and provide a new unsafe API as an escape hatch where the overhead of the thread change is too high.

Dec 12 '23 14:12 adamreichold

well, or is this an entirely unreasonable assumptions and there are good use-cases that make use of allow_threads and with_gil nested in each other in deep recursions?

I think deep nesting is pathological but a certain degree of nesting can be expected due to callback-style API and things like that.

If with_gil required Send, too, one could just make two threads play ping-pong with callbacks between with_gil and allow_threads, so that we’d just need at most a single helper thread per user-created thread. Of course, making with_gil more expensive might be way worse than making allow_threads more expensive, as for the latter, at least, one could argue that it only ever is called when there’s a benefit to be expected from context switching, i.e. it shouldn’t happen for trivially short interactions, anyways, limiting the impact of further overhead.

I think I would want to avoid changing with_gil at all especially due to the last point, i.e. when allow_threads is called, some context switching appears to be expected in any case and making this slightly slower seems reasonable.

Dec 12 '23 15:12 adamreichold

One thing I just noticed is that Python::with_pool is not amenable to this solution because there we actually do want to stay on the thread. But I guess the fix is much more palatable to us: Get rid of the GIL pool and hence with_pool altogether.

Dec 12 '23 15:12 adamreichold

I'm the author of scoped-tls-hkt (but not scoped-tls or the other mentioned crates).

I'm fairly familiar with the Python GIL, but not that familiar with how PyO3 currently solves this problem.

If I were implementing from scratch, I would model it something like this:

When you lock the GIL you get a mutable reference whose lifetime only extends as long as the GIL is locked.
When you temporarily release the GIL you have to return that mutable reference for the time it is released via reborrowing.
Types which require the GIL to be held logically contain an immutable borrow of the GIL.
Types can be "downgraded" to non-GIL versions, and then later upgraded with a new GIL reference in order to retain references across GIL release points.

Comparing this to what I see PyO3 currently implements, it looks like the main difference is that the lifetime on Python<'_> is covariant, making the type itself Copy, and this is what leads to the unsoundness. Presumably this is to make it easier to use, since an invariant lifetime would mean that you could only use the Python<'_> object once without explicit reborrow methods.

However, if you change Python<'_> to be &mut Python then you get to take advantage of Rust's implicit reborrowing, and the object can be used multiple times.

Dec 14 '23 13:12 Diggsey

Comparing this to what I see PyO3 currently implements, it looks like the main difference is that the lifetime on Python<'_> is covariant, making the type itself Copy, and this is what leads to the unsoundness.

If I understood things correctly, making the lifetime invariant was already ruled out as a solution in the middle section of https://github.com/PyO3/pyo3/issues/3640#issuecomment-1851845423, i.e. via the linked technique used to create an exploit for objc2 via scoped-tls-hkt.

Dec 14 '23 13:12 adamreichold

When you temporarily release the GIL you have to return that mutable reference for the time it is released via reborrowing.

That sounds, I think, like it would force users to get rid of all GIL-dependent python object references created with that GIL reference before the temporary release.

I’m not actually a user of pyo3, so I can’t determine how much that would or wouldn’t go against common use cases.

Dec 14 '23 13:12 steffahn

Comparing this to what I see PyO3 currently implements, it looks like the main difference is that the lifetime on Python<'_> is covariant, making the type itself Copy, and this is what leads to the unsoundness.

This should be independent of variance concerns, and all about Copy / the ability to duplicate the Python<'_> token. After all, &'a mut T is covariant in the lifetime 'a as well.

Dec 14 '23 14:12 steffahn

A pretty typical use case for releasing the GIL is to process a &[u8] that was extracted from a &PyBytes. &PyBytes are immutable so it's sound, as long as someone has a reference. Not being able to operate on these types of objects without the GIL would be disruptive I think.

That said, you could work around this via https://docs.rs/pyo3/latest/pyo3/prelude/struct.Py.html#method.as_bytes for this specific case.

Dec 14 '23 14:12 alex

If I understood things correctly, making the lifetime invariant was already ruled out as a solution in the middle section of https://github.com/PyO3/pyo3/issues/3640#issuecomment-1851845423, i.e. via the linked technique used to create an exploit for objc2 via scoped-tls-hkt.

I don't agree - releasing the GIL would require a mutable borrow, so it's impossible to "store" the GIL somewhere for later and also call the release function.

That sounds, I think, like it would force users to get rid of all GIL-dependent python object references created with that GIL reference before the temporary release.

It would force them to be downgraded (to non-GIL objects) or no longer used, yes.

This should be independent of variance concerns, and all about Copy / the ability to duplicate the Python<'_> token. After all, &'a mut T is covariant in the lifetime 'a as well.

Yeah sorry you're correct - it's the T that's invariant in &'a mut T, not the 'a. The important thing is that &mut guarantees uniqueness regardless of anything else, which is what is wanted with the GIL.

A pretty typical use case for releasing the GIL is to process a &[u8] that was extracted from a &PyBytes. &PyBytes are immutable so it's sound, as long as someone has a reference. Not being able to operate on these types of objects without the GIL would be disruptive I think.

I mentioned "downgraded" types - I would expect that safe functionality (such as accessing immutable objects) would be available directly on the downgraded versions of types, so you wouldn't need the GIL in order to access them.

Dec 14 '23 14:12 Diggsey

While I agree that letting references live throughout the scope of the closure passed to allow_threads during which the GIL is actually not held by the current thread is likely close to the root of this, I also think that forcing borrows into GIL-bound data to be released for these sections is too big an ergonomics hit as one usually releases the GIL to continue working with plain Rust data based on those GIL-bound references, e.g. processing a stream of bytes as mentioned by @alex or working on a NumPy array via an ndarray::ArrayView using rust-numpy which is what I am most familiar with.

In my opinion, forcing the whole ecosystem to implement downgraded wrapper types which continue to provide access while the GIL is temporarily released seems too big an ergonomics hit compared to having the type system work this out as it is now.

EDIT: Also c.f. https://github.com/PyO3/pyo3/pull/3610#issuecomment-1855823246 for another situation where the existing design seems to work out well automatically. Of course, the solution by starting threads might still be prohibitively expensive over there...

Dec 14 '23 14:12 adamreichold

In my opinion, forcing the whole ecosystem to implement downgraded wrapper types which continue to provide access while the GIL is temporarily released seems too big an ergonomics hit compared to having the type system work this out as it is now.

Fair enough. Another option would be to always use the "downgraded" versions of the types everywhere, and require a GIL reference to be passed in to all methods that require it (including getters that access mutable data).

Dec 14 '23 14:12 Diggsey

the fact that the standard library’s thread-locals require T: 'static means there’s any guarantees that non-'static data isn’t allowed to be “smuggled” through thread-local storage.

Just chiming in here (without having read the entire thread, sorry) to note that as far as I am concerned, the fact that static are restricted to Sync + 'static does not mean that only Sync + 'static data can be smuggled from a caller to its callee via external means. Any way of dynamically checking that the data is only used in the same thread and in a suitable scope is equally appropriate. There's no fundamental relationship between "contains a non-'static lifetime" and "cannot be smuggled around"; that relationship is entirely coincidental. Non-'static types are harder to smuggle around, but (unless and until the lang team decides otherwise) you can't assume that they are impossible to smuggle around.

Looking at some of the prior discussion here around an Ungil auto trait, I was wondering what the soundness contract of Ungil is that would make this work. There's no Rust type that represents "absence of smuggling". The fact that the type name is negated already is a big red flag indicating it probably can't work; the counterexample in this issue is the nail in the coffin IMO. In general it's hard, if not impossible, to express negative information in type systems. If you want the user to prove something, you need to ask them to produce a value of a certain type that serves as a witness to show that this is indeed proven. "Data you can't smuggle" is a concept that I have no idea how to possibly formalize. It might have to be some notion of "this is a pure function"? Neither 'static nor Send achieve that, that is not their intention.

Dec 14 '23 16:12 RalfJung

Fair enough. Another option would be to always use the "downgraded" versions of the types everywhere, and require a GIL reference to be passed in to all methods that require it (including getters that access mutable data).

My feeling is that this might be a viable long-term solution, but there are very compelling performance and ergonomics reasons to not require the "GIL reference to passed in to all methods that require it". The main reason is for interactions with traits: for example, what about Clone, Drop, Iterator, or even Display, for the downgraded references? At the moment they would incur at minimum a TLS lookup per trait method call to acquire the GIL reference. By having the "attached" types we avoid these overheads. Contexts are a long way off and the design for them is not yet clear, but if there's a possibility to conditionally implement traits based on contexts or even specialize based on context, that might be the way to remove attached types but keep the same ergonomics and performance.

[@Diggsey] If I were implementing from scratch, I would model it something like this:

This is a much better expressed form of what I originally wrote at https://github.com/PyO3/pyo3/issues/3640#issuecomment-1851757613

In my opinion, forcing the whole ecosystem to implement downgraded wrapper types which continue to provide access while the GIL is temporarily released seems too big an ergonomics hit compared to having the type system work this out as it is now.

I definitely agree that it's an ergonomics hit, but maybe it's possible to get there with a macro. Maybe for example we can one day have a macro that takes a list of idents to temporarily downgrade, and reupgrade once the allow_threads call is over.

    let list = PyList::new(...);
    let data = "1, 2, 3";
    allow_threads!(py, [list], { /* do something with `data`, `list` is only accessible as a downgraded reference */ });

I think this line of exploration of is likely impossible until we've moved &'py PyAny to Py2<'py, PyAny> because we first desperately need to decouple reference lifetimes from the GIL lifetime.

Dec 14 '23 16:12 davidhewitt

I definitely agree that it's an ergonomics hit, but maybe it's possible to get there with a macro. Maybe for example we can one day have a macro that takes a list of idents to temporarily downgrade, and reupgrade once the allow_threads call is over.

I am less concerned with the syntactical side of things, but rather that every creator of such a downgraded type would have to worry what is still allowed and what is not, i.e. figuring out a sound API for these types all over the ecosystem.

I think being able to rely on the existing Send trait is a bit of a strength in disguise here, because it allows us to re-use the isolation properties (or lack of) that existing types already have without doing anything PyO3-specific (ndarray::ArrayView working out of the box being an example of that from rust-numpy). The "only problem" we have is that while we try to establish additional boundaries, the lack of actual thread boundaries bites us via these soundness loopholes.

Dec 14 '23 16:12 adamreichold

pyo3 pyo3 copied to clipboard

`Python::allow_threads` is unsound in the presence of `scoped-tls`.

pyo3
pyo3 copied to clipboard