Audit serde of array metadata
We currently implement naive serde using Rust serde + flexbuffers by default. Many arrays can pack their metadata much more tightly. This is an overview issue to track auditing each one:
- [ ] Bool
- [ ] Chunked
- [ ] Constant
- [ ] Datetime
- [ ] Extension
- [ ] Primitive
- [ ] Sparse
- [ ] Struct
- [ ] VarBin
- [ ] VarBinView
- [ ] ALP
- [ ] Datetime Parts
- [ ] Dict
- [ ] BitPacking
- [ ] FoR
- [ ] FSST
- [ ] Delta
- [ ] RoaringInt
- [ ] RoaringBool
- [ ] RunEnd
- [ ] ZigZag
I'm gonna try making the Validity metadata for Structs much smaller.
We might eventually want to squeeze all metadata into 32-bits. We can reserve 0xffffffff to indicate that the metadata has spilled into a buffer.
I think we can spare 64 bits per encoding
For most arrays, validity metadata is just a single bit for whether or not a validity child is defined.
RunEnd
remove length, dtype => ptype (it has to be an int).
pub struct RunEndMetadata {
validity: ValidityMetadata,
ends_dtype: DType,
num_runs: usize,
offset: usize,
length: usize,
}
ALP
pub struct ALPMetadata {
exponents: Exponents,
encoded_dtype: DType,
patches_dtype: Option<DType>,
}
RunEndBool
remove length, dtype => ptype.
pub struct RunEndBoolMetadata {
start: bool,
validity: ValidityMetadata,
ends_dtype: DType,
num_runs: usize,
offset: usize,
length: usize,
}
RoaringInt
pub struct RoaringIntMetadata {
ptype: PType,
}
FoR
Scalar => ScalarValue, use self.dtype(). Buffer, BufferString, List should go into the Array buffer.
pub struct FoRMetadata {
reference: Scalar,
shift: u8,
}
Dict
DType => PType
pub struct DictMetadata {
codes_dtype: DType,
values_len: usize,
}
DateTimeParts
DType => PType.
pub struct DateTimePartsMetadata {
days_dtype: DType,
seconds_dtype: DType,
subseconds_dtype: DType,
}
FSST
DType => PType.
pub struct FSSTMetadata {
symbols_len: usize,
codes_dtype: DType,
uncompressed_lengths_dtype: DType,
}
Null
remove len.
pub struct NullMetadata {
len: usize,
}
Primitive
pub struct PrimitiveMetadata {
validity: ValidityMetadata,
}
VarBin
DType => PType
pub struct VarBinMetadata {
validity: ValidityMetadata,
offsets_dtype: DType,
bytes_len: usize,
}
Delta
pub struct DeltaMetadata {
validity: ValidityMetadata,
deltas_len: usize,
offset: usize, // must be <1024
}
RoaringBool
Remove length
pub struct RoaringBoolMetadata {
length: usize,
}
BitPacked
Remove length.
pub struct BitPackedMetadata {
validity: ValidityMetadata,
bit_width: usize,
offset: usize, // Know to be <1024
length: usize, // Store end padding instead <1024
has_patches: bool,
}
ByteBool
pub struct ByteBoolMetadata {
validity: ValidityMetadata,
}
ZigZag
pub struct ZigZagMetadata
Extension
DType => PType
pub struct ExtensionMetadata {
storage_dtype: DType,
}
Struct
Remove length.
pub struct StructMetadata {
length: usize,
validity: ValidityMetadata,
}
Chunked
pub struct ChunkedMetadata {
num_chunks: usize,
}
Sparse
remove len, DType => PType, Scalar => ScalarValue.
pub struct SparseMetadata {
indices_dtype: DType,
// Offset value for patch indices as a result of slicing
indices_offset: usize,
indices_len: usize,
len: usize,
fill_value: Scalar,
}
Constant
Scalar => ScalarValue, remove length.
pub struct ConstantMetadata {
scalar: Scalar,
length: usize,
}
Bool
Remove length.
pub struct BoolMetadata {
validity: ValidityMetadata,
length: usize,
bit_offset: usize,
}
VarBinView
pub struct VarBinViewMetadata {
validity: ValidityMetadata,
data_lens: Vec<usize>,
}
Three relevant PRs:
- https://github.com/spiraldb/vortex/pull/956
- https://github.com/spiraldb/vortex/pull/955
- https://github.com/spiraldb/vortex/pull/951
This has been done across multiple prs. I did an audit and all metadata looks minimal