rerun icon indicating copy to clipboard operation
rerun copied to clipboard

Split Tensor component into several archetypes

Open Wumpf opened this issue 1 year ago • 1 comments

  • Split out of https://github.com/rerun-io/rerun/issues/6388

Related to:

  • https://github.com/rerun-io/rerun/issues/2341

We generate archetypes and components for all tensor variants (TensorF32, TensorU8, etc) and make sure they share the same Visualizer:

archetype TensorU8 {
    buffer: BufferU8,
    
    // One of these
    shape: TensorShape,
    shape: Vec<TensorDimension>,
}

component BufferU8 {
    data: [u8],
}

archetype TensorF32 {
    buffer: BufferF32,

    // One of these
    shape: TensorShape,
    shape: Vec<TensorDimension>,
}

component BufferF32 {
    data: [f32],
}
  • mechanics of same-visualizer are a bit unclear. Have visualizer just listen to several indicators / archetypes? Breaks 1:1 relationship that we were striving for. Can revisit later?
  • this will break some "use this tensor like an image" cases that we allow today. Mitigate only as far as meaningful

Impact on Mesh's texture: Log an Image archetype at the same spot instead.

Detailed rationale (via @jleibs on https://github.com/rerun-io/rerun/issues/6388#issuecomment-2134003885):

Most of the choices for working with tensors fall into one of 4 categories.

Typed buffer, multiple data-types (the proposal)

Pros:

  • When processing a chunk the raw arrow data is much easier to work with
  • Opportunity to align with the official arrow spec for tensor representation
  • Aligns with our long-term direction of wanting to have multiple types and datatype conversions

Cons:

  • Multi-datatype representation means we must either proliferate typed components or introduce datatype conversions.

The current hypothesis is that proliferating types is a known challenge and can be mostly automated with a mixture of code-gen and some helper code, whereas datatype conversions is an unknown challenge.

Still this puts us on a pathway where once we support multi-typed components, we mostly delete a bunch of code and everything gets simpler. Any type conversions move from visualizer-space to data-query-space, but the types and arrow representations we work with don't actually need to change.

Untyped buffer with type-id

Pros

  • Avoids arrow unions while maintaining a single datatype.

Cons

  • Forces arrow users to do annoying user-space datatype casting.
  • Doesn't align with our long-term goals

Typed buffer with union

Pros

  • Status quo. Already works.

Cons

  • Forces arrow users to do annoying poorly supported union operations when loading or reading tensors.

Wumpf avatar Jul 09 '24 13:07 Wumpf

An alternative is to only have many Buffer components (BufferU8, BufferU16, …), but only one Tensor archetype:

archetype Tensor {
    shape: TensorShape,
    dimension_names: Option<DimensionNames>,
    
    // Set exactly one of these:
    buffer_u8: Option<BufferU8>,
    buffer_u16: Option<BufferU16>,
    buffer_u32: Option<BufferU32>,
    …

    color_model: Option<ColorModel>, // to interpret this tensor as an image
}

I believe this will lead to a lot less duplicated code

emilk avatar Jul 15 '24 12:07 emilk

Most of this is done, the rest is covered by

  • https://github.com/rerun-io/rerun/issues/9119

emilk avatar Feb 24 '25 16:02 emilk