Split Tensor component into several archetypes
- Split out of https://github.com/rerun-io/rerun/issues/6388
Related to:
- https://github.com/rerun-io/rerun/issues/2341
We generate archetypes and components for all tensor variants (TensorF32, TensorU8, etc) and make sure they share the same Visualizer:
archetype TensorU8 {
buffer: BufferU8,
// One of these
shape: TensorShape,
shape: Vec<TensorDimension>,
}
component BufferU8 {
data: [u8],
}
archetype TensorF32 {
buffer: BufferF32,
// One of these
shape: TensorShape,
shape: Vec<TensorDimension>,
}
component BufferF32 {
data: [f32],
}
- mechanics of same-visualizer are a bit unclear. Have visualizer just listen to several indicators / archetypes? Breaks 1:1 relationship that we were striving for. Can revisit later?
- this will break some "use this tensor like an image" cases that we allow today. Mitigate only as far as meaningful
Impact on Mesh's texture: Log an Image archetype at the same spot instead.
Detailed rationale (via @jleibs on https://github.com/rerun-io/rerun/issues/6388#issuecomment-2134003885):
Most of the choices for working with tensors fall into one of 4 categories.
Typed buffer, multiple data-types (the proposal)
Pros:
- When processing a chunk the raw arrow data is much easier to work with
- Opportunity to align with the official arrow spec for tensor representation
- Aligns with our long-term direction of wanting to have multiple types and datatype conversions
Cons:
- Multi-datatype representation means we must either proliferate typed components or introduce datatype conversions.
The current hypothesis is that proliferating types is a known challenge and can be mostly automated with a mixture of code-gen and some helper code, whereas datatype conversions is an unknown challenge.
Still this puts us on a pathway where once we support multi-typed components, we mostly delete a bunch of code and everything gets simpler. Any type conversions move from visualizer-space to data-query-space, but the types and arrow representations we work with don't actually need to change.
Untyped buffer with type-id
Pros
- Avoids arrow unions while maintaining a single datatype.
Cons
- Forces arrow users to do annoying user-space datatype casting.
- Doesn't align with our long-term goals
Typed buffer with union
Pros
- Status quo. Already works.
Cons
- Forces arrow users to do annoying poorly supported union operations when loading or reading tensors.
An alternative is to only have many Buffer components (BufferU8, BufferU16, …), but only one Tensor archetype:
archetype Tensor {
shape: TensorShape,
dimension_names: Option<DimensionNames>,
// Set exactly one of these:
buffer_u8: Option<BufferU8>,
buffer_u16: Option<BufferU16>,
buffer_u32: Option<BufferU32>,
…
color_model: Option<ColorModel>, // to interpret this tensor as an image
}
I believe this will lead to a lot less duplicated code
Most of this is done, the rest is covered by
- https://github.com/rerun-io/rerun/issues/9119