tch-rs
tch-rs copied to clipboard
Possible UB in Tensor
Hi,
I'm not super familiar with the design of the library, but I've been trying to use it in a project of mine. In doing so I found that Tensor::f_copy mutates a Tensor that's passed as a immutable reference, which is UB in Rust. I think one of the arguments should be changed to &mut to fix this. Sorry if I've missed some context.
Thanks!
Thanks for reporting this issue. Indeed mut is not handled properly. This is mostly due to a lot of the wrapping code being auto-generated from a description file that doesn't specify whether tensors are mutated or not.
That said, the f_copy_ is not automatically generated so I just changed it as per your suggestion to have the proper &mut self argument.
There is also a convention in PyTorch method that methods ending with an underscore are mutating, I'll make further changes so that the automatically generated code handles this properly.
Great, thanks for the prompt response!
The code generation has been tweaked to reflect that functions ending with underscores can mutate self. I'm not sure whether this covers all possible scenarios but this should at least improve the current situation. Feel free to re-open if you see further inconsistencies.
This is slightly unrelated to this issue, but I see that a number of functions return a Tensor out that has different relations with the input Tensor in. For example, out might be a view into in, or out might be a entirely new Tensor.
In general, there are some lifetime relations between out and in which the current code doesn't quite capture, which could be UB. For example, if out is a view into in, at the moment I can drop in and still keep out around. I'm not sure if this would be UB or not as in is backed by storage allocated on the C side of things, but it is something that should be kept in mind.
ndarray gets around this by having different ArrayOwned, ArrayView, etc. types that are instantiations of a core ArrayBase struct. maybe something similar could be useful here?
You're perfectly right. Again the issue comes from the code being mostly generated from some ops definition file and this file doesn't include much information about what is shared between the different inputs and output.
I don't think it's very straightforward to handle this properly. Maybe would it be worth writing a more type safe api on top of these wrappers, but this is likely to be a bit involved if one wants to handle all the operations provided by torch. One pretty tough case is the backprop operation: when called on ops o this could modify all the gradients on the computation graph that lead to o.
Meanwhile note that we still some form of memory safety: a tensor memory shouldn't be released too early (this is provided by the C++ api using refcounting internally).
In that case could having Tensor wrap its contents in a Rc/Rc<RefCell<...>> help? Otherwise it might be prudent to have these possibly unsafe operations as unsafe, to alert users of the issue? Either way, this should be a different issue, I guess.
Unsafe seems a bit too much to me in this case: it cannot cause a segfault but there may be some data sharing that is not represented in the type system. I would have thought that the only issues this creates are potential race conditions when using multiple threads - maybe not implementing the Send trait for tensors so that they cannot be shared between threads would be enough.
Hmm it can still cause problems. The following example (adapted from the swap example in this blog post) demonstrates some unexpected issues:
// Say we have Tensor t1
let t2 = t1.reshape(t1.shape()); // t2 is now a view of t1
let swap = |x, y| {
*x = *x ^ *y;
*y = *x ^ *y;
*x = *x ^ *y;
};
swap(&mut t1, &mut t2); // Now t1 and t2 are both 0
Unless reshape() outputs a newly allocated slice for t2, and unless PyTorch enforces Rust-like borrowing in its ref-counting, you'll get the above phenomenon, which should be impossible in safe Rust. (Note that if one tried to do the above via Rc in Rust, one would get a runtime panic because x and y would would be simultaneously mutably and immutably borrowing the underlying storage.)
Very interesting example thanks, indeed there is more to it than race conditions. I'm still not sure whether fixing this will be in the scope of this crate or of another layer on top of it, I have to think more about it but meanwhile I reopened this issue so that it's clearer what the current behavior is.
Also see https://github.com/PyO3/pyo3/issues/342, this chapter in wasm-bindgen, and finally this stackoverflow question
I'll bump this just to point out the comment I recently added to the PyO3 issue, which is basically that rust-cpython uses &[Cell<T>] and &[ReadOnlyCell<T>] to fix this sort of UB:
https://github.com/PyO3/pyo3/issues/342#issuecomment-506410116
The title of this issue may be unnecessarily alarmist; I was certainly alarmed. It’s UB for a Rust reference to alias mutable data, and that would be very bad. But this library doesn’t appear to actually take Rust references to the underlying data inside a Tensor; there’s no operation like fn data(&Tensor) -> &[u8]. It may be confusing for the data accessible through a Tensor to change due to aliasing, and perhaps there are ways to reduce that confusion with better types, but it’s not UB if it’s not observed through Rust references.
Linking burn's ticket #235 which uses tch-rs that may have the same issue. So if we have a solution you can look up too.
@antimora that's very interesting, do you have a small tch example that triggers the UB? I don't think we ever came up with one and it's unclear that it could actually happen but if you have an example that would be very helpful in order to propose a fix.
The problem arises when using tch::Tensor wrapped inside an Arc for a pattern similar to "copy-on-write". However, not all operations return tensors with newly allocated memory.
For instance, reshaping a tensor followed by an in-place operation may or may not change the original tensor depending on the shape. So it seems like you have to treat tensors as unsafe since they are not really owners of their data.
let tensor = ...
let mut tensor_cloned = tensor.shallow_clone();
tensor_cloned.relu_(); // Changes the first tensor
It's not a big problem when you explicitly call shallow_clone followed by an in-place operation, but some non-mutable operations (like narrow, reshape) sometimes return a tensor that uses the same memory location as their parent.
I fixed the problem by carefully tracking references to data_ptr, which is definitely an unsafe pattern. Since tch aims to provide bindings to libtorch in the most direct way possible, I'm not sure if it's something that can be fixed, but it can be documented somewhere.
Do you have evidence that this is actually undefined behavior, or is it merely a confusing use of interior mutability? You can do confusing things with Arc<RefCell<T>> in completely safe Rust, too.
@LaurentMazare
@antimora that's very interesting, do you have a small tch example that triggers the UB? I don't think we ever came up with one and it's unclear that it could actually happen but if you have an example that would be very helpful in order to propose a fix.
I asked @nathanielsimard to respond to your request since he is the one who fixed a bug on our end.
@andersk I would say this:
Mutating immutable data. All data inside a const item is immutable. Moreover, all data reached through a shared reference or data owned by an immutable binding is immutable, unless that data is contained within an UnsafeCell<U>.
The difference with RefCell is that it will panic if you try to have multiple mutable references to the same value, or just a "readonly" reference when one mutable reference exists. However, I don't think shallow_clone validates that. Maybe making the function unsafe would better communicate that it's the caller responsability to ensure that multiple tensors will not mutate the same data.
From the same link:
unsafe only means that avoiding undefined behavior is on the programmer; it does not change anything about the fact that Rust programs must never cause undefined behavior.
That’s my point though—as far as I’m aware, the data in a Tensor is not reachable through a shared reference or owned by an immutable binding. It’s only reachable through a raw pointer, which is not subject to those rules.
https://github.com/LaurentMazare/tch-rs/blob/123c96c88d6f63511e58df7246834b39d89f1ab5/src/wrappers/tensor.rs#L18-L20
If this library were to expose an operation that dereferences this raw pointer and returns a Rust reference, that could be used by safe code to trigger undefined behavior. But I don’t see an operation like that.
(Rc<Cell<T>> is a closer analogy in safe Rust, since there’s no safe way to get a &T out of that.)
Hmm, the Send implementation is suspicious though:
https://github.com/LaurentMazare/tch-rs/blob/123c96c88d6f63511e58df7246834b39d89f1ab5/src/wrappers/tensor.rs#L22
based on this note from an upstream contributor that “Tensors are thread safe to read from multiple threads but they are not threadsafe to write”.
I'm not qualify enough to say with certainty what is or is not undefined behavior, I'm just pointing out that in-place operations can have surprising side effects, especialy when coupled with operations that may or may not return a newly allocated tensor.
I'm also not sure if there is something to do here, LibTorch was not written in Rust and if a view is modified, they may chose to modify the parent tensor as well! I don't see how this could be fixed while still keeping and API as close as possible to the C++ API.
It aims at staying as close as possible to the original C++ api. More idiomatic rust bindings could then be developed on top of this.