Introduce `re_sorbet` Chunk/Dataframe conversion crate handle arrow metadata
Context
We currently have 2 different arrow-metadata encoding schemas:
- In query API results we use:
sorbet.path,sorbet.semantic_family,sorbet.logical_type,sorbet.semantic_type
- In chunk transport we use:
rerun.entity_path,rerun.archetype_name,rerun.archetype_field_name, Name of column as component name
Keeping track of which encoding is required at different points in the pipeline is hard and only adds confusion while bringing no meaningful utility.
Proposal
We will continue to maintain two separate encodings, but we will work to normalize them and align with the rerun names, as more generic sorbet concepts currently lead to confusion.
We will stop using sorbet name until we have cycles to make this a more universal spec. We use rerun names even in the dataframe API because these are specific Rerun-APIs that we are currently exposing.
We might as well use this as an opportunity to add versioning. Both types will include two new schema-metadata-level properties: rerun.schema_version, and rerun.batch_variant so that we can differentiate them.
Proposed v1: RerunChunk encoded data.
Schema-level metadata
rerun.schema_version= 1rerun.batch_variant= "chunk"rerun.id= The Chunk Id (required)rerun.entity_path= The entity-path for the whole chunk
Control Column-level metadata
rerun.kind= "control"
Index Column-level metadata
rerun.kind= index | time (for backwards compat)rerun.is_sortedrerun.index_name= If unset, we use the Field-name for backwards compatabillityrerun.dataframe_column_name= The original column from a converted dataframe (Optional)
Data Column-level metadata
rerun.kind= "data"rerun.archetype_namererun.archetype_field_namererun.component_name-- If unset, we use the Field-name for backwards compatibillityrerun.dataframe_column_name= The original column from a converted dataframe (Optional)
All data columns MUST be wrapped as a ListArray type.
Proposed v1: RerunDataframe encoded data.
On INGEST paths (rr.send_dataframe), we want to be generally forgiving and make a best-effort to interpret an arrow payload as a dataframe, even if it's missing top-level metadata. On OUTPUT paths (dataframe query results) we should always include the full metadata
Schema-level metadata
rerun.schema_version= 1 (Optional) If missing we assume the latest versionrerun.batch_variant= "dataframe" (Optional) If missing we assume a "dataframe"rerun.entity_path= (Optional) Defines the entity_path for any column where that entity_path is not set
Index Column-level metadata
rerun.kind= indexrerun.is_sortedrerun.index_name= If unset, we use the Field-name
Data Column-level metadata
rerun.kind= "data"rerun.archetype_namererun.archetype_field_namererun.component_namererun.entity_path(optional)
Ideally, data columns of mono-types in the dataframe representation should NOT need to be list-wrapped. Requiring users to structure data in this way is a significantly larger burden than simply adding metadata tags to their columns. Doing it on-ingest in Rerun is a significantly better experience.
Plan
We will introduce a new standalone crate which includes utilities for identifying, validating, and converting between "Dataframe" and "Chunk" representations.
The primary transformation that needs to happen for v1.
- If more than 1 entities are present, split into separate chunks.
- Any index column is duplicated to each chunk.
- For each chunk, inject a
rerun.idfor the ChunkId - For each chunk, synthesize a control-column of Rerun RowIds
- For any datatypes which are not
ListArrays, introduce a single-element List wrapper. - (Maybe?) If no index column was provided, synthesize a monotonic sequence index.
The following places are good candidates for using this new crate:
- https://github.com/rerun-io/rerun/blob/main/crates/store/re_chunk_store/src/dataframe.rs
- https://github.com/rerun-io/rerun/blob/main/rerun_py/rerun_sdk/rerun/dataframe.py
- Most of this python code should go away. Rather than using send_columns under the hood, we should pass the arrow-encoded dataframe directly to rust and handle this in https://github.com/rerun-io/rerun/blob/main/rerun_py/src/dataframe.rs
- https://github.com/rerun-io/rerun/blob/343da4aee536f75d1c4bfe1439f377bbe3a5c8ae/crates/store/re_grpc_client/src/lib.rs#L407
Some frameworks have restrictions on what defines a valid column-name. For example: lance doesn't permit top-level field names to contain a ".".
This is a good argument for Rerun being agnostic about the choice of field-name and encoding all metadata as proper metadata fields.
We discussed today a bit on how to handle mono components, i.e. the common case of single-instances (e.g. scalars). We want to make this as ergonomic as possible.
- We will start supporting RecordBatches with Mono-types in them, and do the list-array wrapping on the way into the chunkstore so that code in viewer-space never has to worry about mono-types.
- Future: we can also introduce a sorbet tag, e.g.
rerun.monoand use that to unwrap-list arrays when generating query results in order to make these datatypes round-trip properly from the perspective of users.
This is the direction I'm heading:
pub struct ChunkSchema {
chunk_id: ChunkId,
entity_path: EntityPath,
columns: Vec<ComponentColumnDescriptor>,
…
}
impl From<ChunkSchema> for ArrowSchema { … }
impl TryFrom<ArrowSchema> for ChunkSchema { … }
// Replaces TransportChunk
struct ChunkBatch {
schema: ChunkSchema,
batch: ArrowRecordBatch,
}
impl AsRef<ArrowRecordBatch> for ChunkBatch { … }
impl TryFrom<ArrowRecordBatch> for ChunkBatch { … }
And similar for dataframes:
pub struct DataframeSchema {
// Each data column is guaranteed to have an entity path
columns: Vec<ComponentColumnDescriptor>,
…
}
impl AsRef<ChunkSchema> for ArrowSchema { … }
impl TryFrom<ArrowSchema> for DataframeSchema { … }
struct DataframeBatch {
schema: DataframeSchema,
batch: ArrowRecordBatch,
}
impl Deref<Target = ArrowRecordBatch> for DataframeBatch { … }
impl TryFrom<ArrowRecordBatch> for DataframeBatch { … }
My comment on #8965 might be relevant here too:
https://github.com/rerun-io/rerun/pull/8965#discussion_r1946130488
It would be really great to have that information as actual types. Conceptually, it would be great to have
struct Dataframe<I: Index>, whereIndex, in this case, would be something likeResourceId. For data frames that don't have an explicit index (the order is the index), we could default tostruct Dataframe<I: Index = ()>.Of course
protodoes not give us generics, but maybe we could "monomorphize" the few instances where this is the case. We could then have specialized constructors + deserializers that ensure these invariants.
What's still to be done is the dataframe portion of this. Should sync with @timsaucer here.
Also some misc cleanup:
- [ ]
Chunk::to_chunk_batchshould be infallible