vortex icon indicating copy to clipboard operation
vortex copied to clipboard

Array metadata to use protobuf

Open gatesn opened this issue 11 months ago • 7 comments

We still use flex buffers heavily as a hangover from the prototype.

This is like a binary form of JSON, vs a schema'd format like flat buffer. Therefore our files often include hundreds of thousands of occurrences of the word "validity" and other such dumb things.

Our encoding metadata should become increasingly stable. And we should pack it as small as possible.

This probably means using packed C-like structs, e.g. primitive array needs very few bits for its metadata.

We should then optimise the surrounding flat buffer to support inline metadata if it fits into a u64, and otherwise, store a length and metadata buffer.

gatesn avatar Dec 31 '24 04:12 gatesn

Here's the current struct-size (not #[repr(packed)], just current state) of all metadatas:

ALP: 24
ALPRD: 40
Bool: 2
Chunked: 8
Constant: 56
DateTimeParts: 3
Dict: 16
FSST: 16
RunEndBool: 24
RunEnd: 24
ZigZag: 0
Extension: 0
List: 16
Null: 0
Primitive: 1
Sparse: 80
Struct: 1
VarBin: 16
VarBinView: 32
BitPacked: 24
RoaringBool: 0
Delta: 16
FoR: 24
RoaringInt: 1
ByteBool: 1

There are some obvious things that standout:

  • ConstantMetadata is 56 bytes because of ScalarValue, which is 56 bytes because of BufferString -> ByteBuffer. SparseMetadata is 80 bytes for similar reasons
  • A lot of metadatas have an offset field which is usize, which is variable and I don't think we can reasonable statically truncate that to something smaller. Best we can probably do here is a varint

In the example uploaded in #1749, I decoded the first chunk of the file and dumped some tree-style-like information in JSON (to make it easier to parse).

Here's the file:

chunks.json.gz

Metadata appears to be ~1/3 of the overall size of the chunk's data:

image

Using direct structs (without packing or varints) would net us back a large chunk of that 33%:

image

a10y avatar Jan 03 '25 02:01 a10y

Are you saying this is a good indication of the target size of metadata? It would be worth adding the metadata_bytes().length() to that table since that's what's actually in the file itself. Hint: it's a LOT worse...

The other thing to think about is how flatbuffer works. Storing a vec<u8> I think always means storing u64 length + the bytes themselves. Instead, we could have two separate fields: metadata: u64 and metadata_bytes: option<vec<u8>> and read from metadata_bytes if it's present, otherwise use the compact 64 bits.

Constant I'm not too worried about because it's typically such a huge improvement over any other compression strategy that 56 bytes seems worth it.

gatesn avatar Jan 03 '25 10:01 gatesn

The JSON file's metadata_nbytes is array.metadata_bytes().len() collected for every array.

That total size is about 33% of the data size. My second comment was just trying to ballpark what the overhead would be if we switched to a naive binary format (like struct serialization or bincode). 9% seems a little high but I haven't looked at Parquet's overhead

a10y avatar Jan 03 '25 13:01 a10y

I've spent some time today poking at a few potential ways we can compact metadata.

I'm holding the ideal criteria to be these 3 things:

  1. Must be compact. Avoid self-describing formats that require embedding schemas on the wire. Schemas should be provided externally, e.g. compiled into the binary, or provided once in the file footer.
  2. Defined dynamically. Externally authored encodings must be able to participate. This precludes defining in the top-level flatbuffer directly.
  3. Evolvable. We want to allow for new versions of metadata for an encoding, for changes that are small enough to not break the data layout. Field renames, reordering, adding new optional fields should all be supported. Deleting or other incompatible changes will result in having to mint a new encoding version.

Of note, it is not a constraint that the format support zero-copy deserialization. We should prefer a packed format that forces a copy to read it if it saves us space on the wire.

Options

The top options I've found so far are

  1. Using repr(packed) structs
  2. Flatbuffers
  3. rkyv
  4. Avro

I've gone and implemented a simple version of the ListMetadata for each of these, which is defined as

pub struct ListMetadata {
    validity: ValidityMetadata,  // u8 enum
    elements_len: u64,
    offset_ptype: PType, // u8 enum
}

I've copied this as-written from the existing Vortex implementation. Note that due to the way struct packing works, because the u64 field comes second, we end up with an additional 8 bytes of unnecessary padding. I evaluated both with this format as well as with the optimized format that transposes the first two elements. It didn't make a significant difference on the results.

repr(packed)

For every XYZMetadata, we'd generate a PackedXYZMetadata type that contains the same fields, but with a #[repr(packed)] alignment spec.

Due to the caveats with packed structs in Rust, we would only use this for encoding on-the-wire, on load we'd read the values into our normally aligned structs for in-memory access.

There's no schema evolution story per se, we'd need to implement a way to check that subsequent changes to the packed metadata definition for an encoding don't cause breakage.

FlatBuffers

FlatBuffers allow for a struct type, which is more compact than a table but has the downside of not being evolvable.

Here's how we'd implement this for ListMetadata:

enum ValidityMetadata: ubyte {
    NonNullable = 0,
    AllValid = 1,
    AllInvalid = 2,
    Array = 3,
}

enum PType: ubyte {
    U8,
    U16,
    U32,
    U64,
    I8,
    I16,
    I32,
    I64,
    F16,
    F32,
    F64
}

struct ListMetadata {
    validity: ValidityMetadata;
    elements_len: uint64;
    offset_ptype: PType;
}

This will result in a 24-byte message for every ListMetadata, as FlatBuffers is basically the same as repr(C) and will add alignment padding internally. If we use the optimized field ordering from transposing the first two fields, we'd end up with 16 bytes per-value.

rkyv

rkyv is a zero-copy serialization format that was originally built for SWC. It effectively just takes any struct and builds a new version that is repr(C) and explicitly aligned. The only reason I evaluated it is because it's incredibly easy to build from a Rust struct via a derive-macro. You just #[derive(Archive)].

Because it is the same as a normal aligned Rust struct, it will still be 24 bytes or 16 bytes like the Flatbuffers definition.

Avro

Avro has a reduce set of supported types compared to FlatBuffers, but it uses zigzag varint encoding, so most values end up getting reduced in size.

Here is the Avro schema definition of ListMetadata:

{
    "type": "record",
    "name": "ListMetadata",
    "fields": [
        {
            "name": "validity",
            "type": {
                "type": "enum",
                "name": "MemValidityMetadata",
                "symbols": ["NonNullable", "AllValid", "AllInvalid", "Array"]
            }
        },
        {"name": "elements_len", "type": "long" }, // note that long is i64
        {
            "name": "offset_ptype",
            "type": {
                "type": "enum",
                "name": "MemPType",
                "symbols": ["U8", "U16", "U32", "U64", "I8", "I16", "I32", "I64", "F16", "F32", "F64"]
            }
        }
    ]
}

If we encode the struct {validity:AllValid, elements_len:4,194,303, offset_ptype:U64} get can do so in only 6 bytes

[2, 80, 80, 80, 4, 6]

This gives us the benefits of packed plus the benefit of Avro's varint encoding.

Another nice goodie is first-class support for enforcing forward/backward schema compatibility, which would allow us to pretty easily write some checks to make sure encoding authors only make backward-compatible changes to their metas.


BLAB: Avro seems like a pretty good option based on the criteria expressed above. The big downside is that it is Adding One More Thing ™️. I think we can do some DX work to abstract the Avro-ness away from the user, such as

  • proc-macros to generate and check schemas
  • add some helpers to convert between unsigned rust <> signed avro types (as well as 8/16-bit types that aren't default supported in Avro).

Or, we could make our own encoding that does this, plus maybe a little extra like packed bitsets.

Curious for thoughts

a10y avatar Jan 06 '25 23:01 a10y

If we're considering Avro, why not protobuffers?

The other thing to consider is that we shouldn't enforce the format. Not all encodings need to use the same representation. For the "stable" builtin encodings we could go for a packed format.

gatesn avatar Jan 07 '25 06:01 gatesn

Protobuf is similar in size, but it doesn't quite get all the way there since it needs to add field numbers:

image image

Agree on not enforcing the format, but I do think it's likely that whatever choices we make for the core encodings will be picked up by downstream encoding authors.

a10y avatar Jan 07 '25 14:01 a10y

Since we changed Vortex arrays to hold metadatai in-memory, we no longer need zero-cost reads from serialized metadata. This relaxes the constraints and we should probably just use protobuf by default for most metadatas. This gives us maximum compatibility in case we ever do have an implementation in another language.

gatesn avatar Apr 14 '25 08:04 gatesn

This is done

robert3005 avatar Jun 03 '25 15:06 robert3005