tch-rs icon indicating copy to clipboard operation
tch-rs copied to clipboard

Implement Clone for Tensor?

Open edlanglois opened this issue 3 years ago • 2 comments

I am interested in the reasons for / against implementing Clone for tensor. The only existing discussion I'm aware of is part of #281.

To start, it would be nice to have Clone implemented for Tensor. I'd like to be able to easily clone structures that store tensors.

There is a C++ clone function that could be used. I see that this function is in Declarations-*.yaml but it does not appear in the generated api (at least for tch-rs, it's in the api for ocaml-torch). ~~Even if the decision is not to implement Clone for Tensor it would be nice to expose this function as g_clone or something.~~ [Edit: Oops I forgot about Tensor::copy. Why is that done in rust with copy_ instead of with the C++ clone()?]

There is possibly an issue with whether Tensor::clone should have a gradient back to the source. From the torch perspective you would expect a clone() method to behave the same as it does in C++ and Python, which is to be differentiable with respect to the source.

From the rust side that might be unexpected: if I clone a parameter tensor then I don't want the new parameter tensor to have gradients back to the original. I'm not sure that detaching is the solution either. If you have

let x = Tensor::ones(&[0], ...).requires_grad_(true);
let y = 2 * x;
let z = y.clone();

then should z be differentiable with respect to x as though it were y or simply be detached? From a rust perspective I'd kind of expect z to be exactly like y which means being differentiable with respect to x.

But the more I think about it though the more I think it's fine for clone() to be differentiable. In the above example, dy/dx = 2 and dy/dy = 1 and z would behave the same when substituting it for y: dz/dx = 2, dz/dy = 1, dz/dz = 1. As for differentiably cloning parameter tensors I think it's unlikely that you would differentiate a function of the clone with respect to the original and if you do, the fact that C++/Pytorch clone is differentiable ought to be enough of a warning to consider that the tch version might be too.

Another risk with implementing Clone is that the new variable isn't tracked by a VarStore but I don't think that should be problem either. If you are calling clone() manually then it should be expected that the clone won't be in a VarStore. The risk would be deriving Clone for a struct that stores Tensors and has those Tensors in a VarStore except you can't do that because VarStore doesn't implement Clone.

Any other reasons for / against? At this point I am in favour of implementing Clone for Tensor using the differentiable C++ clone() function.

edlanglois avatar Jun 01 '21 20:06 edlanglois

Replying to https://github.com/LaurentMazare/tch-rs/issues/281#issuecomment-855287935 from @jerry73204

Another concern is that whether .clone() is shallow or deep affects the semantics of ownership. Currently, the self-modifying methods like .absolute_(&mut self) expects unique ownership. The shallow copy shares the underling data, making the ownership unsound. If libtorch knows concurrent self-modifying operations, just like many other concurrent data structures like dashmap, it should expect shared ownership.

I agree that clone() should be a deep copy. One reason is that the python and c++ clone() methods perform a deep copy so any tch-rs clone() should do the same. Another reason is that, as a user, I think of tch::Tensor as owning the data (even though that's not quite true) so I expect clone() to copy the data. The tch-rs API doesn't really advertise the fact that tch::Tensor is essentially Rc<RefCell<Data>> (on the C++ side), it's called Tensor not TensorHandle after all, so I think it would be surprising to experience a shallow copy on clone().

As for shallow_clone(), are you sure it's unsound? I'm not an expert on rust's soundness rules or anything but in sound rust code you can have two non-mut Rc<RefCell<Data>> pointing to the same data, modify the data with one, and then it will be modified when the other reads it. I assume that on the C++ side there is some sort of lock to prevent multiple concurrent modifications. If the C++ side is something like Arc<RefCell<Mutex<Data>>> then as far as I know it ought to be fine. The RefCell checks of only-one-&mut wouldn't be there in C++ (just the inner mutability) but since the tch-rs API doesn't let you safely get a pointer to the underlying data, I don't think you can end up with multiple mut references.

edlanglois avatar Jun 05 '21 21:06 edlanglois

The saying of soundness is that unique ownership &mut self does not respect the fact that data is shared in shallow copy. It should be &self instead even for self-modifying methods. It's not wrong about implementation, it's the API that does not respect the rule.

Data sharing is common libtorch in various ways, such as Tensor::view() leads to two owned instances with shared data. It could be explicit like Arc<RefCell<_>>, or be transparent and documented in terms of ownership. The thing is the ownership model of Tensor must be clearly defined. Consider the cases:

  • Tensor has unique ownership to data: .absolute_(&mut self) has mut, .view(self, ...) takes the ownership of type. .shallow_clone() becomes unsound and should be removed.
  • Tensor points to shared data: .absolute_(&self) takes shared ownership. .view(&self, ...) takes reference and produce owned type. Deep .clone() and .shallow_clone() can be both defined.

Taking mutable pointer to data can be marked unsafe. The practical use is to implement low level operations with CUDA. An example can be found in my tch-nms (link).

jerry73204 avatar Jun 06 '21 06:06 jerry73204