vortex icon indicating copy to clipboard operation
vortex copied to clipboard

Audit serde of array metadata

Open gatesn opened this issue 1 year ago • 6 comments

We currently implement naive serde using Rust serde + flexbuffers by default. Many arrays can pack their metadata much more tightly. This is an overview issue to track auditing each one:

  • [ ] Bool
  • [ ] Chunked
  • [ ] Constant
  • [ ] Datetime
  • [ ] Extension
  • [ ] Primitive
  • [ ] Sparse
  • [ ] Struct
  • [ ] VarBin
  • [ ] VarBinView
  • [ ] ALP
  • [ ] Datetime Parts
  • [ ] Dict
  • [ ] BitPacking
  • [ ] FoR
  • [ ] FSST
  • [ ] Delta
  • [ ] RoaringInt
  • [ ] RoaringBool
  • [ ] RunEnd
  • [ ] ZigZag

gatesn avatar Jun 20 '24 09:06 gatesn

I'm gonna try making the Validity metadata for Structs much smaller.

danking avatar Sep 20 '24 15:09 danking

We might eventually want to squeeze all metadata into 32-bits. We can reserve 0xffffffff to indicate that the metadata has spilled into a buffer.

danking avatar Sep 20 '24 16:09 danking

I think we can spare 64 bits per encoding

robert3005 avatar Sep 20 '24 16:09 robert3005

For most arrays, validity metadata is just a single bit for whether or not a validity child is defined.

gatesn avatar Sep 21 '24 10:09 gatesn

RunEnd

remove length, dtype => ptype (it has to be an int).

pub struct RunEndMetadata {
    validity: ValidityMetadata,
    ends_dtype: DType,
    num_runs: usize,
    offset: usize,
    length: usize,
}

ALP

pub struct ALPMetadata {
    exponents: Exponents,
    encoded_dtype: DType,
    patches_dtype: Option<DType>,
}

RunEndBool

remove length, dtype => ptype.

pub struct RunEndBoolMetadata {
    start: bool,
    validity: ValidityMetadata,
    ends_dtype: DType,
    num_runs: usize,
    offset: usize,
    length: usize,
}

RoaringInt

pub struct RoaringIntMetadata {
    ptype: PType,
}

FoR

Scalar => ScalarValue, use self.dtype(). Buffer, BufferString, List should go into the Array buffer.

pub struct FoRMetadata {
    reference: Scalar,
    shift: u8,
}

Dict

DType => PType

pub struct DictMetadata {
    codes_dtype: DType,
    values_len: usize,
}

DateTimeParts

DType => PType.

pub struct DateTimePartsMetadata {
    days_dtype: DType,
    seconds_dtype: DType,
    subseconds_dtype: DType,
}

FSST

DType => PType.

pub struct FSSTMetadata {
    symbols_len: usize,
    codes_dtype: DType,
    uncompressed_lengths_dtype: DType,
}

Null

remove len.

pub struct NullMetadata {
    len: usize,
}

Primitive

pub struct PrimitiveMetadata {
    validity: ValidityMetadata,
}

VarBin

DType => PType

pub struct VarBinMetadata {
    validity: ValidityMetadata,
    offsets_dtype: DType,
    bytes_len: usize,
}

Delta

pub struct DeltaMetadata {
    validity: ValidityMetadata,
    deltas_len: usize,
    offset: usize, // must be <1024
}

RoaringBool

Remove length

pub struct RoaringBoolMetadata {
    length: usize,
}

BitPacked

Remove length.

pub struct BitPackedMetadata {
    validity: ValidityMetadata,
    bit_width: usize,
    offset: usize, // Know to be <1024
    length: usize, // Store end padding instead <1024
    has_patches: bool,
}

ByteBool

pub struct ByteBoolMetadata {
    validity: ValidityMetadata,
}

ZigZag

pub struct ZigZagMetadata

Extension

DType => PType

pub struct ExtensionMetadata {
    storage_dtype: DType,
}

Struct

Remove length.

pub struct StructMetadata {
    length: usize,
    validity: ValidityMetadata,
}

Chunked

pub struct ChunkedMetadata {
    num_chunks: usize,
}

Sparse

remove len, DType => PType, Scalar => ScalarValue.

pub struct SparseMetadata {
    indices_dtype: DType,
    // Offset value for patch indices as a result of slicing
    indices_offset: usize,
    indices_len: usize,
    len: usize,
    fill_value: Scalar,
}

Constant

Scalar => ScalarValue, remove length.

pub struct ConstantMetadata {
    scalar: Scalar,
    length: usize,
}

Bool

Remove length.

pub struct BoolMetadata {
    validity: ValidityMetadata,
    length: usize,
    bit_offset: usize,
}

VarBinView

pub struct VarBinViewMetadata {
    validity: ValidityMetadata,
    data_lens: Vec<usize>,
}

danking avatar Sep 30 '24 19:09 danking

Three relevant PRs:

  1. https://github.com/spiraldb/vortex/pull/956
  2. https://github.com/spiraldb/vortex/pull/955
  3. https://github.com/spiraldb/vortex/pull/951

danking avatar Oct 02 '24 13:10 danking

This has been done across multiple prs. I did an audit and all metadata looks minimal

robert3005 avatar Oct 30 '24 11:10 robert3005