vortex icon indicating copy to clipboard operation
vortex copied to clipboard

Better data layout for nested/repeated schemas

Open a10y opened this issue 1 month ago • 9 comments

Problem: Currently, Vortex has poor support for nested data types.

Consider the following schema:

D describe events;
┌─────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │                                                                 column_type                                                                 │  null   │   key   │ default │  extra  │
│   varchar   │                                                                   varchar                                                                   │ varchar │ varchar │ varchar │ varchar │
├─────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┤
│ id          │ VARCHAR                                                                                                                                     │ YES     │ NULL    │ NULL    │ NULL    │
│ type        │ VARCHAR                                                                                                                                     │ YES     │ NULL    │ NULL    │ NULL    │
│ actor       │ STRUCT(id BIGINT, login VARCHAR, display_login VARCHAR, gravatar_id VARCHAR, url VARCHAR, avatar_url VARCHAR)                               │ YES     │ NULL    │ NULL    │ NULL    │
│ repo        │ STRUCT(id BIGINT, "name" VARCHAR, url VARCHAR)                                                                                              │ YES     │ NULL    │ NULL    │ NULL    │
│ payload     │ STRUCT("action" VARCHAR, issue STRUCT(url VARCHAR, repository_url VARCHAR, labels_url VARCHAR, comments_url VARCHAR, events_url VARCHAR, …  │ YES     │ NULL    │ NULL    │ NULL    │
│ public      │ BOOLEAN                                                                                                                                     │ YES     │ NULL    │ NULL    │ NULL    │
│ created_at  │ TIMESTAMP                                                                                                                                   │ YES     │ NULL    │ NULL    │ NULL    │
│ org         │ STRUCT(id BIGINT, login VARCHAR, gravatar_id VARCHAR, url VARCHAR, avatar_url VARCHAR)                                                      │ YES     │ NULL    │ NULL    │ NULL    │
└─────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────┴─────────┴─────────┴─────────┘

Currently the way our StructStrategy works, this is written with the following layout tree:

StructLayout
|__ "payload" child (Zoned) dtype=struct{id, name,}
	|__ data: FlatLayout
	|__ zones: ZoneMap(null_count)
... (other columns)

Putting the entire Struct column inside of a FlatLayout has many downsides

  • No stats propagation. Currently we can't pushdown operations over the nested fields
  • Terrible random access performance. All nested fields get read together when the FlatLayout is deserialized. This is really bad, and the problems gets worse the larger and bushier the schema
Image

We end up having a similar problem for schemas with LIST fields, both nested or even top-level. Our layout strategies are not aware of complex types and it causes pain where users have to work around it by flattening the schema themselves.

Solution Space

RepDef

Parquet and Lance solve the this problem using repetition and definition levels. There's a lot of prior art here so I won't go into detail, but there are good links covering the basics here and here.

RepDef exists to solve the dual problem of wasted space and multiple buffer access

Weston's article makes a good case for repdef: namely, that it reduces the amount of data that needs to be scanned for a random cell access. You only need 1-2 IOs max to read a specific cell value, and then some linear amount of CPU decoding to find the N-th value and its validity bit.

What RepDef is not as good at is zero-copy deserialization. Converting back into Arrow requires you to unravel the repdef back into the hierarchical format.

Component Shredding

A more simple solution involves a layout that performs full component shredding, and holds onto metadata required to unshred the nested fields back together at scan time.

For example, the layout would be responsible for taking something like the following (JSON)

{
  "a": {
      "b": [
         { "c": true, "d": 1 },
         null,
         { "c": false, "d": 2 },
         { "c": null, "d": 3 },
      ]
   }
}      

Into the following components

ShreddedLayout
  |__ a.b.validity
  |__ a.b.offsets
  |__ a.b.c.values
  |__ a.b.c.validity
  |__ a.b.d.values

Projecting specific components would be pretty simple here, and we can do it zero-copy directly from the file, for example reading the array SELECT d FROM unnest(a.b) involves

  • Decode a.b.validity
  • Decode a.b.d.values

Random access requires extra steps, decoding potentially several buffers to access a single cell. For example, say we wanted to access the value at a.b[0].c, this would require:

  • Decode a.b.validity, check if a.b[0] is null
  • If not, decode a.b.offsets. Find offset for 0-th list
  • Decode a.b.c.validity at the 0-th offset position, to check if null
  • Decode a.b.c.values at the 0-th offset position to retrieve the value

In all, this takes 4 random IOs, on top of the initial IO to retrieve the footer

a10y avatar Oct 08 '25 15:10 a10y

There's also the associated expression pushdown we'd need to add to support things like UNNEST and the like to support projection of nested columns

a10y avatar Oct 08 '25 15:10 a10y

We have getitem, pack and merge that let you recompose and pick apart nested structs. There’s a variation on repdef described in Procella paper where they do some kind of cumulative sum.

robert3005 avatar Oct 08 '25 15:10 robert3005

I'm surprised our current strategy doesn't create nested struct layouts? Is that what you're saying @a10y ?

Ah, I think we lost that behavior at some point... Back when we have a LayoutStrategy thing take a DType, we would always unpack a struct dtype.


Also worth nothing that we can unpack into custom vortex rep-def based encodings, rather than always directly to Arrow. And often we can unpack to arrow faster since we use the view-based encodings that support random access. Meaning we can pre-allocate the output arrays and just write directly into the non-null elements.

gatesn avatar Oct 08 '25 16:10 gatesn

Would nested a nested variant type always have to be fully shredded? One of the nice things about parquet's new variant is that it can be partially shredded. We've had customers do things like put UUIDs in mapping keys (mostly by accident, e.g. unpacking a dictionary in Python) but that can result in thousands of unique keys per row.

adriangb avatar Oct 15 '25 12:10 adriangb

I think variant types are separate from what we consider nested schemas (as variants are essentially untyped, there is no nesting of the schema itself).

In that case, yes, we would support partial shredding. Since Vortex is logically typed, this also means we could support arbitrary un-shredded formats that can represent variant-typed data. For example, JSON, MsgPack, Parquet's variant binary format, CBOR, etc etc.

For nested schemas, we're really talking about strongly typed schemas that can often be very sparse in their actual data.

gatesn avatar Oct 15 '25 13:10 gatesn

Okay sorry for polluting the issue then. Do you have an issue to track variant / untyped nested data? I see it mentioned in https://github.com/vortex-data/vortex/issues/2116 but no issue linked

adriangb avatar Oct 15 '25 13:10 adriangb

How would Lists work with this unnest?

joseph-isaacs avatar Oct 16 '25 17:10 joseph-isaacs

If we just made the writer actually write Struct of Struct layouts where appropriate why would that not be preferable to this?

joseph-isaacs avatar Oct 16 '25 17:10 joseph-isaacs

Hello , I wanna know about the layout thing, Is StructLayout only response to Struct data type ? Based on my understanding, are there two types of layouts here: one related to data types, such as struct, array (list), flat, and another unrelated one, such as zonemap, dict, chunked layouts that are more focused on index information?

amorynan avatar Nov 25 '25 09:11 amorynan