rerun icon indicating copy to clipboard operation
rerun copied to clipboard

Introduce `re_sorbet` Chunk/Dataframe conversion crate handle arrow metadata

Open jleibs opened this issue 11 months ago • 6 comments

Context

We currently have 2 different arrow-metadata encoding schemas:

  • In query API results we use:
    • sorbet.path, sorbet.semantic_family, sorbet.logical_type, sorbet.semantic_type
  • In chunk transport we use:
    • rerun.entity_path, rerun.archetype_name, rerun.archetype_field_name, Name of column as component name

Keeping track of which encoding is required at different points in the pipeline is hard and only adds confusion while bringing no meaningful utility.

Proposal

We will continue to maintain two separate encodings, but we will work to normalize them and align with the rerun names, as more generic sorbet concepts currently lead to confusion.

We will stop using sorbet name until we have cycles to make this a more universal spec. We use rerun names even in the dataframe API because these are specific Rerun-APIs that we are currently exposing.

We might as well use this as an opportunity to add versioning. Both types will include two new schema-metadata-level properties: rerun.schema_version, and rerun.batch_variant so that we can differentiate them.

Proposed v1: RerunChunk encoded data.

Schema-level metadata

  • rerun.schema_version = 1
  • rerun.batch_variant = "chunk"
  • rerun.id = The Chunk Id (required)
  • rerun.entity_path = The entity-path for the whole chunk

Control Column-level metadata

  • rerun.kind = "control"

Index Column-level metadata

  • rerun.kind = index | time (for backwards compat)
  • rerun.is_sorted
  • rerun.index_name = If unset, we use the Field-name for backwards compatabillity
  • rerun.dataframe_column_name = The original column from a converted dataframe (Optional)

Data Column-level metadata

  • rerun.kind = "data"
  • rerun.archetype_name
  • rerun.archetype_field_name
  • rerun.component_name -- If unset, we use the Field-name for backwards compatibillity
  • rerun.dataframe_column_name = The original column from a converted dataframe (Optional)

All data columns MUST be wrapped as a ListArray type.

Proposed v1: RerunDataframe encoded data.

On INGEST paths (rr.send_dataframe), we want to be generally forgiving and make a best-effort to interpret an arrow payload as a dataframe, even if it's missing top-level metadata. On OUTPUT paths (dataframe query results) we should always include the full metadata

Schema-level metadata

  • rerun.schema_version = 1 (Optional) If missing we assume the latest version
  • rerun.batch_variant = "dataframe" (Optional) If missing we assume a "dataframe"
  • rerun.entity_path = (Optional) Defines the entity_path for any column where that entity_path is not set

Index Column-level metadata

  • rerun.kind = index
  • rerun.is_sorted
  • rerun.index_name = If unset, we use the Field-name

Data Column-level metadata

  • rerun.kind = "data"
  • rerun.archetype_name
  • rerun.archetype_field_name
  • rerun.component_name
  • rerun.entity_path (optional)

Ideally, data columns of mono-types in the dataframe representation should NOT need to be list-wrapped. Requiring users to structure data in this way is a significantly larger burden than simply adding metadata tags to their columns. Doing it on-ingest in Rerun is a significantly better experience.

Plan

We will introduce a new standalone crate which includes utilities for identifying, validating, and converting between "Dataframe" and "Chunk" representations.

The primary transformation that needs to happen for v1.

  • If more than 1 entities are present, split into separate chunks.
    • Any index column is duplicated to each chunk.
  • For each chunk, inject a rerun.id for the ChunkId
  • For each chunk, synthesize a control-column of Rerun RowIds
  • For any datatypes which are not ListArrays, introduce a single-element List wrapper.
  • (Maybe?) If no index column was provided, synthesize a monotonic sequence index.

The following places are good candidates for using this new crate:

  • https://github.com/rerun-io/rerun/blob/main/crates/store/re_chunk_store/src/dataframe.rs
  • https://github.com/rerun-io/rerun/blob/main/rerun_py/rerun_sdk/rerun/dataframe.py
    • Most of this python code should go away. Rather than using send_columns under the hood, we should pass the arrow-encoded dataframe directly to rust and handle this in https://github.com/rerun-io/rerun/blob/main/rerun_py/src/dataframe.rs
  • https://github.com/rerun-io/rerun/blob/343da4aee536f75d1c4bfe1439f377bbe3a5c8ae/crates/store/re_grpc_client/src/lib.rs#L407

jleibs avatar Jan 20 '25 17:01 jleibs

Some frameworks have restrictions on what defines a valid column-name. For example: lance doesn't permit top-level field names to contain a ".".

This is a good argument for Rerun being agnostic about the choice of field-name and encoding all metadata as proper metadata fields.

jleibs avatar Jan 21 '25 17:01 jleibs

We discussed today a bit on how to handle mono components, i.e. the common case of single-instances (e.g. scalars). We want to make this as ergonomic as possible.

  • We will start supporting RecordBatches with Mono-types in them, and do the list-array wrapping on the way into the chunkstore so that code in viewer-space never has to worry about mono-types.
  • Future: we can also introduce a sorbet tag, e.g. rerun.mono and use that to unwrap-list arrays when generating query results in order to make these datatypes round-trip properly from the perspective of users.

emilk avatar Jan 28 '25 19:01 emilk

This is the direction I'm heading:

pub struct ChunkSchema {
    chunk_id: ChunkId,
    entity_path: EntityPath,
    columns: Vec<ComponentColumnDescriptor>,
    …
}
impl From<ChunkSchema> for ArrowSchema { … }
impl TryFrom<ArrowSchema> for ChunkSchema { … }

// Replaces TransportChunk
struct ChunkBatch {
    schema: ChunkSchema,
    batch: ArrowRecordBatch,
}
impl AsRef<ArrowRecordBatch> for ChunkBatch { … }
impl TryFrom<ArrowRecordBatch> for ChunkBatch { … }

And similar for dataframes:

pub struct DataframeSchema {
    // Each data column is guaranteed to have an entity path
    columns: Vec<ComponentColumnDescriptor>,
    …
}
impl AsRef<ChunkSchema> for ArrowSchema { … }
impl TryFrom<ArrowSchema> for DataframeSchema { … }

struct DataframeBatch {
    schema: DataframeSchema,
    batch: ArrowRecordBatch,
}
impl Deref<Target = ArrowRecordBatch> for DataframeBatch { … }
impl TryFrom<ArrowRecordBatch> for DataframeBatch { … }

emilk avatar Feb 05 '25 14:02 emilk

My comment on #8965 might be relevant here too:

https://github.com/rerun-io/rerun/pull/8965#discussion_r1946130488

It would be really great to have that information as actual types. Conceptually, it would be great to have struct Dataframe<I: Index>, where Index, in this case, would be something like ResourceId. For data frames that don't have an explicit index (the order is the index), we could default to struct Dataframe<I: Index = ()>.

Of course proto does not give us generics, but maybe we could "monomorphize" the few instances where this is the case. We could then have specialized constructors + deserializers that ensure these invariants.

grtlr avatar Feb 07 '25 08:02 grtlr

What's still to be done is the dataframe portion of this. Should sync with @timsaucer here.

emilk avatar Mar 18 '25 13:03 emilk

Also some misc cleanup:

  • [ ] Chunk::to_chunk_batch should be infallible

emilk avatar Apr 17 '25 06:04 emilk