vortex Roadmap for high quality compression?

Thanks for this project. I'm wondering if catering to high compression ratio is on the roadmap, e.g. for long-term storage or network-bandwidth-limited workloads?

If so, from skimming the code, the main blocker I see is the default layout, which currently repartitions to blocks of 8192 rows prior to compression. For high compression ratio, one would rather repartition to something like 1MB or more (say, 2^18 rows), "train" a codec on that data, resulting in say a zstd dictionary or pcodec chunk metadata (call this "the metadata"), write the metadata, and then write your 8192-row blocks with stats following that, leveraging the metadata to compress each block.

Does this make sense, or do I misunderstand the default layout?

Apr 12 '25 22:04 mwlon

The default layout repartitions to the next multiple of 8K rows above 1MB of uncompressed data. (You should be able to verify this with the 'cargo vx' command line)

But you're right, that the default BtrBlocksCompressor tries to strike a balance between decompression speed and size, vs being optimised for size alone.

Edit: sorry, I see now where you got the 8k rows. Regarding stats, we do first repartition to exactly 8K rows, compute a zone map, then repartition again as per the above 1MB requirement. So the zone stats aren't tied to the physical chunking of the data.

Apr 12 '25 23:04 gatesn

Aha, I see that now. In that case, maybe one could slap zstd or pcodec on, without any extra metadata business! Are you all planning to add anything like this in cases when the user prefers heavier compression? If not, would the addition be welcome?

Apr 13 '25 10:04 mwlon

Yep, so we've left open a slot for per-segment encryption and general purpose compression in the footer that we plan to support soon: https://github.com/spiraldb/vortex/blob/develop/vortex-flatbuffers/flatbuffers/vortex-file/footer.fbs

But separately, we're digging into PCodec a little more to 1) find out if block-level compression would give us satisfactory random access, 2) whether pcodec's strategy selection can help improve compression speed (we spend a lot of time now computing unique count when dict encoding isn't very useful for numeric data), or 3) whether we should just ship a pcodec encoding.

I wonder if you have had any thoughts on the above?

We also hope to have pre-defined use-cases that come with a good set of encodings and layouts. For example, scientific data may benefit from a different set of encodings to analytical data, since we probably care more about compression ratio than we do about random access.

Apr 13 '25 14:04 gatesn

I can partially answer those:

Pco decompresses around 1-4GB/s per CPU core. If that's too slow, I'd recommend an approach like what I described before: write the Pco chunk metadata for the 1MB block somewhere (you could treat this similarly to dictionary encoding's list of unique values), and then use it to decompress smaller Pco pages, so that you only need to decompress, say, 4096 values to get the one you care about. Your wrapping format just needs to supply the page start/end byte ranges for this approach.
I suspect the flow chart for numeric data types should be

match what_user_wants {
  SuperFast -> ..., // use lightweight encodings. Probably skip dictionary altogether since it's just numeric?
  GoodCompression -> ..., // use pco
}

The one thing that might help lightweight encodings is to reuse Pco's automatic mode selection (if Pco were to expose that in the crate somehow)? For that step, Pco does sample the data, but is able to just use statistical techniques on the sample instead of actually compressing it. This would have implications for your downstream encodings and might improve their compression ratios in some cases. More details in the upcoming paper. 3. Likely!

Apr 13 '25 15:04 mwlon

@gatesn Would you be open to PRs for pcodec and zstd encodings?

Apr 30 '25 20:04 mwlon

Absolutely for PCodec. I have a sort of background goal of reducing the surface area required to implement a new encoding, it's a little bit messy right now, but the existing ones should give you a good jumping off point.

For ZStd, I'd rather enable compression at the segment level. This is something we're tracking already and should be able to bring forwards. We can then configure it for metadata / data only, or for specific columns, etc.

Apr 30 '25 20:04 gatesn

With tree-style encodings, why treat general purpose codecs differently? Aren't they effectively just bytes->bytes encodings?

May 01 '25 12:05 mwlon

Even with a tree of arrays, we need some sort of leaf node - in our case ByteBuffers.

So while e.g. a PrimitiveArray could in theory store its data in a u8-typed ZstdArray, what would ZstdArray store its data in?

Given our premise places a lot of focus on lightweight encodings, it feels reasonable to say that general purpose compression operates at the buffer level. Or at least, we can make it easy to enable/disable taht. Although there really isn't anything preventing you from writing a ZstdArray, but you would need your own arrays to store bytes in child arrays rather than buffers.

May 01 '25 12:05 gatesn

I'd think that there should be a single terminal/leaf encoding, and that it should be the only one directly containing a buffer. (Equivalently, one could have a dynamic choice whether each encoding's child is a buffer or another array, but that's probably messier.) In this way, any type-compatible tree would be possible, and you wouldn't have need for a special "compression" feature. To me, this seems simpler and more natural.

But that's just aesthetics, and I respect any choice you make in that department. My practical concern is that this will create a difference in behavior between u8 compressions and compressions that take any other data type (like pco). They both output opaque bytes, should often be decompressed in memory, and be treated the same by vortex. Will this property be possible with the planned design? I think very often the strongest reasonable compression strategy is to apply zstd for string-like data and pco for numerical data, which might be bizarre to configure if they are treated as a compression and encoding respectively.

May 01 '25 17:05 mwlon

I think I see where you're going...

This is because PCodec is largely just a set of transformations that make the data more amenable to general purpose compression? For example, twiddle some bits and then zstd?

Whereas all of our encodings thus far essentially have bit-packed integers as their terminal node. Which of course then gets stored in buffers that have no further compression (by default).

We went back and forth on this a little bit, whether we should distinguish between a u8 semantic integer and a u8 as part of a byte array where we wouldn't get very far trying to apply integer compression techniques.

Does that sort of make sense? Might explain the different angles we're approaching this from. Of course, I'm very open to fixing this if something needs to change!

May 01 '25 18:05 gatesn

No, it's the opposite: pco fully compresses by itself. Applying zstd afterward would almost always be a waste of (de)compression time. Pco only compresses 16, 32, and 64 bit data types. Does my concern make sense now? If not, maybe it'll be easiest to set up a video chat about this next week.

May 01 '25 18:05 mwlon

From the video call: I think our conclusion was that it makes sense to have encodings for pco and zstd, and compression options for non-array data (e.g. statistics). @gatesn please let me know when your refactor is done and I'll look into adding these encodings.

May 17 '25 20:05 mwlon

Refactor for arrays is done. You'll see the basic idea would be something like:

vtable!(Zstd)

impl VTable for ZstdVTable {
   ...
}

Hopefully there are sufficient docs to help implement the required functions. Let me know if you have any questions and I can try to answer ASAP.

Worth nothing two things:

the current compressor isn't pluggable, so you'll have to manually change it to, or change the layout strategy in VortexWriteOptions in order to use the new encodings when writing files.
you may want to just ignore validity in the first pass, or use the Validity helper struct (copying from PrimitiveArray for example)

May 17 '25 20:05 gatesn

I'm going to start on this now.

May 30 '25 17:05 mwlon

I have an incomplete Zstd branch here: https://github.com/vortex-data/vortex/compare/develop...mwlon:vortex:zstd-encoding?expand=1

A couple things we could aim for, and I'm unsure whether to do them in this initial PR (both of which will apply to Pco as well):

Should we support zstd dictionaries and write smaller zstd frames for faster slice + indexing access? E.g. write multiple frames of 1k values that share a single dictionary, and only decompress the required frame(s) for each lookup.
Should we do caching at this point? E.g. once we need to decompress the data once, store the decompressed version in the array's memory for any future access. I'm leaning against this for now, since we can easily add it in later without breaking anything.

Jun 03 '25 17:06 mwlon

This is a great start, thank you. Couple of notes/questions:

Does it only support primitive types?
nbytes is a bit funky honestly, I would grab uncompressed_len from the buffer directly. But if it's primitive only, you can get away with just storing the length (in which case you also don't need serialized metadata since you're given the array len during deserialization)
slice should be O(1), so may need to hold offset/len in-memory.
agree on the cached decompression, we're considering adding this in other places, e.g. list offsets.

Somewhat defer to you on splitting into chunks/dictionaries. We version encodings by identifier, so we could add a zstd.chunked encoding later, or we could add it in a backwards compatible way, but we'd probably want to at least design it now, even if we punt the implementation.

Jun 03 '25 17:06 gatesn

primitive types: I think so? E.g. I think it should be possible to apply Dict encoding, then Zstd encode the values and indices (each a primitive array), but it wouldn't make sense to Zstd encode the DictArray directly. Does that mesh with your understanding, or am I missing something?

nbytes, slice, caching: Sounds good, will take this into account.

chunking: It would be nice if these are the same encoding. How backward compatible do you need it? If we currently implement the metadata but give an error if someone actually tries to decode a chunked array, I suppose that's not backward compatible? It feels like designing a solution here is 80% of the work, so I might just implement something appropriate (allow chunks of n rows).

Jun 03 '25 19:06 mwlon

I've completed the PR: https://github.com/vortex-data/vortex/pull/3456

Jun 04 '25 21:06 mwlon

Both Zstd and Pco are in, so I'm renaming this issue to be more specific and closing. Will start another for making them easily usable as a compression strategy.

Jun 23 '25 18:06 mwlon