burn
burn copied to clipboard
Device in Tensor Type
Feature Description
There is a call for a feature that bundles the device a tensor is on in its type. This is to ensure that device compatibilities become compile-time errors instead of runtime errors, which would significantly ease development. This feature would be useful since it helps to detect and prevent critical errors during the development stage thereby saving a great deal of time during the debugging process.
Feature motivation
The primary motivation behind this feature is to simplify the development process and make it less error-prone. By bundling the device a tensor is on with its type, developers can know in at compile time whether there are any device compatibility issues. This effectively shifts the error detection from runtime to compile time, thereby saving lots of time and efforts which would otherwise be spent on debugging post compilation.
Proposed Solution
A potential solution would be to implement a mechanism that includes device information as a category within the tensor type. This would require a shift in how tensors are defined in burn. For instance, instead of the current definition that may look like this:
Tensor<Backend, Dims, Float>
We could have something like:
Tensor<Backend, Dims, Backend::Device<CPU>, Float>
Here, the tensor's type now includes information regarding the device it's on, which makes tensor operation compatibility a compile-time concern.
Trade-offs
One potential tradeoff is that this may result in more verbose code with longer tensor definitions. Additionally, this change might break backward compatibility with existing code that assumes the former style of tensor definition.
The backend type is already carrying the Device type information burn-tensor/src/tensor/backend/base.rs
.
What's the problem did you come across? Can you give a working example? It'd be easier to understand the problem with an example, maybe.
For example running the following code yields an error that the two tensors are on different devices when trying to add them together. It would be great if this was caught at compile time instead of runtime.
type Backend = burn_tch::TchBackend<f32>;
let device = burn_tch::TchDevice::Cuda(0);
let a: Tensor<Backend, 1, Int> = Tensor::arange_device(0..10, &device);
let b: Tensor<Backend, 1, Int> = Tensor::arange(0..10);
let c = a + b;
For example running the following code yields an error that the two tensors are on different devices when trying to add them together. It would be great if this was caught at compile time instead of runtime.
type Backend = burn_tch::TchBackend<f32>; let device = burn_tch::TchDevice::Cuda(0); let a: Tensor<Backend, 1, Int> = Tensor::arange_device(0..10, &device); let b: Tensor<Backend, 1, Int> = Tensor::arange(0..10); let c = a + b;
I see. Yeah, currently the device information is tracked in Enum, so it won't break your compilation. Just like having different shapes won't break compilation (but different dimensions and element types will).
This change will be a major refactor even if I would agree with the new design, which I am inclined not to because the benefits are so minor compared to disadvantages. Of course, I could be wrong, so I'll let the original designer and implementor give his perspective. @nathanielsimard , what do you think?
Both shape checking and device checking aren't done with Burn, though there is an experimental feature for shape checking.
The primary reason to avoid having the device type in the tensor struct is verbosity and multi-GPUs setup. I think it's important to support a dynamic amount of GPUs with Burn without recompilation. However I'm willing to improve the design for this, there are multiple ways where we can reduce the amount of runtime errors caused by different devices.
- When choosing the Tch backend, provide the default device as generic along with the precision.
- Have a new backend generic argument that changes the device type:
TchBackend<f32, Cpu>
,TchBackend<f32, All>
,TchBackend<f32, Gpu>
with two new structs for device typeCpu
,Cuda { index: usize }
. - Instead of crashing, switch a CPU tensor to the GPU if one of the tensor is on the GPU during an operation.
Those are just ideas that won't really affect the verbosity. Ultimately the device type is abstracted by the backend trait, so it's up to the backend to provide a convenient way to chose the device on which the model will run. In the case of a CPU or GPU only backend, the problem of dynamically changing the device isn't an issue. @antimora @Gadersd what are your thoughts?
I like option 2 as it's explicit, perhaps allows the compiler to automatically choose the correct device in some cases, and enforces an extra layer of reliability as long as it won't hinder burn's flexibility. Option 3 would be convenient but I would rather not rely on easily overlooked implicit behavior that could potentially be a performance bottleneck in some cases. Option 1 is also somewhat implicit and can still result in device errors from calls to to_device. I think Rust has a huge amount of potential for reducing the runtime errors that plague frameworks such as PyTorch and I would like to see compile time verification taken as far as it reasonably can so that machine learning can have the reliability it deserves.
I think one of the area where I'm a bit unsure is how to convert
efficiently data from a backend to another. Sometimes you want some computation (like metrics) to be calculated on the CPU and some computation (training) to be computed on the GPU. Relying on into_data
and from_data
may slowdown the conversion, but will work with all backends. I'm unsure on how we can introduce a backend conversion API that is easy to use, but optional to implement by backends, more like a specialization API for performance.
I like the option 2 as well. I would be happy to see if we could achieve this without a lot of disruption. If it only affects the backend types then there aren't that many required changes.
So, how would this work with add
operator in the context of @Gadersd example? Would TensorPrimitive contain device information? Here is the add
signature:
fn add<const D: usize>(
lhs: B::TensorPrimitive<D>,
rhs: B::TensorPrimitive<D>,
) -> B::TensorPrimitive<D>;
Hi all!
I was falling in love with Burn for a couple of weeks now, but I ran into the same issue with operations involving different devices. The issue seems to be so obviously critical, yet no one discusses it here for a long while. I wonder why?
One cannot rely on Backend::Device::default() doing the right thing all the time in places like this one
impl<B, const D: usize, K> Tensor<B, D, K>
...
/// Create a tensor of the given shape where each element is one.
pub fn ones<S: Into<Shape<D>>>(shape: S) -> Self {
Self::ones_device(shape, &B::Device::default())
}
especially while the Device::default()
is something as primitive as this
impl Default for CandleDevice {
fn default() -> Self {
Self::Cpu
}
}
The issue is not so obvious for WgpuBackend
only because it tries to use the best device available by default, which is what most people want anyway.
Burn must have a special type representing a strategy for choosing a device for a tensor if one hasn't been specified. Without it, one cannot control where each Module
is going to create its tensors, which makes the whole beauty of the framework kinda useless. Ideally, a developer should be allowed to create any such strategy for a Backend
. Roughly, something equivalent to Fn(...) -> B::Device
. What do you think?
I think your proposal is very flexible and will reduce the number of errors! The default devices can be specified as a generic argument in the backend definition.
I think your proposal is very flexible and will reduce the number of errors! The default devices can be specified as a generic argument in the backend definition.
The more I think about the issue, the more I realize that a simple trick with a "strategy type, providing a default device" is a bad idea. This is why.
Imagine that you need to create 10 tensors on 10 different devices. Naturally, you would want to do this in a loop. But with a strategy type, you would need to have 10 different types, one for each case. This is not a feasible solution.
We just have to admit, that creating a tensor on a device is a "runtime thing", not a "type thing", and must be solved with instances of a type (Device), not just the type itself. Currently, when we create a new tensor, there's no instance of a device to be found.
impl EmbeddingConfig {
/// Initialize a new [embedding](Embedding) module.
pub fn init<B: Backend>(&self) -> Embedding<B> {
let weight = self
.initializer
.init([self.n_embedding, self.d_model])
.require_grad();
Embedding {
weight: Param::from(weight),
}
}
}
Nothing here can possibly tell us about the device we must use for the weights. Backend
here is merely a type, we cannot dynamically pass an exact device id through it (without creating a pile of types like Gpu1, Gpu2, Gpu3... which would kill portability between machines). Which means, we must introduce a new argument device: B::Device
, which leads to a new signature
impl EmbeddingConfig {
/// Initialize a new [embedding](Embedding) module.
pub fn init<B: Backend>(&self, device: B::Device) -> Embedding<B> {
...
}
}
and a massive refactoring as a result. Which will change every single module shipped with the Burn, and get rid of all ``Tensor::*_device(...)` methods along the way. I don't see any way around, if we want to do this properly.
The only way we could keep deviceless implementation is to make sure we don't actually create any real tensors, but instead create a specification for them. Like in a computational graph, which is device-agnostic, and can be executed on any device, given some inputs. But Burn doesn't do that, it executes the code eagerly.
I don't even want to consider the idea of automatically converting tensors between devices (say from CPU to GPU): this would lead to constant and silent troubles with performance and potential bugs with back-propagation.
Hi all! I've completed the work of introducing explicit device specification for modules that are part of the framework. It passes all tests, but I'm not yet sure I've caught 100% of initializations relying on default devices. Please, let me know if you're interested in merging this into the main codebase, so I would know if I should re-check everything again and make a pull request.
@kpot, I think you really understood the problem well. Having a default device is handy for a lot of use cases, but I agree that when building more robust networks, we need to handle on which device the execution is running on.
I don't think we should remove all operations finishing with *_device, but I think we should update the init method to receive the device as a parameter. I think it will remove most errors, but keep the simple tensor API in place. We could consider changing the naming though. Like instead of having ones and ones_device, we could have ones and ones_default where ones_default would execute on the default device, therefore encouraging explicitly giving the device. What are your thoughts?
@nathanielsimard Yeah, I was thinking about renaming functions like zeros_device
to just zeros
(this would help to avoid potential mistakes during coding), and the old ones
becoming zeros_default_device
. The full list isn't that large and consists of
-
from_bool
-
arange
-
arange_step
-
zeros
-
ones
-
full
-
empty
-
from_data
-
from_floats
But then I dismissed the idea as breaking too much. Now I think though, this should be done as you described. People anyway will have to update not only their init()
function's signature, but add the device
to all internal calls as well. And renaming zeros_device
to just zeros
will actually make this work a bit easier: one will have to update only the arguments, not touching the name of the function.
Ok, I'll keep this in mind.
I think the solution is a breaking change, so let's do it correctly! Let's rename the functions well following the naming format:
zeros(shape) -> zeros_default(shape)
zeros_device(shape, device) -> zeros(shape, device)
I would also cleanup the backend traits to only have the zeros(shape, device)
and delete the other method. It can be implemented in the tensor api folder instead: burn_tensor/tensor/api/
.
Like you said, we would need to pass the device to all init
functions as well.
Hi @nathanielsimard! I'm sorry for the delay, it took quite a while for me to complete the refactoring, in which, as you described
zeros(shape)
becomes zeros_default(shape)
and zeros_device(shape, device)
become zeros(shape, device)
. I have updated ONNX importer as well, to support model instantiation on a specific device. The Book was adjusted too.
@kpot Just reviewed the PR, well done 👏
@nathanielsimard @antimora Hi all! Another crazy large PR (hopefully, the last one), this time removing all _devauto
functions. Please, let me know if you have any thoughts about de-serialization, which currently remains the only place that relies on automatic allocation of tensors on a default device.
Just to follow up on this issue, the only remaining part would be to set the device for deserialization.
I think we can safely close this issue since #1081 and @nathanielsimard took care of device specification in the records.