lance Lance File Format Version 2 (technically v0.3)

I have been investigating potential changes to the Lance file format. These changes are for a number of reasons but the highlights are:

Allow encodings to change on a per-page basis
Get rid of row groups because there is no way to specify an ideal row group size when working with large data
Allow for writing data one column at a time in addition to one row group at a time
Allow columns to be different sizes (this will make it possible to use lance files in places we can't use them today like to help with shuffling)
Allow more flexibility into where metadata is stored

This change will not be a single PR. I'm creating a parent task to track the work that needs to be done.

Initially, these changes will not be accessible at all (e.g. nothing will use a v2 writer by default)

Complete implementation

[x] https://github.com/lancedb/lance/issues/1959
[x] schema metadata from dataset schema not on batches (test_filter_parsing)
[ ] Projection support in python (with nested projection)
[ ] Complete pushdown statistics integration
[x] #2347
[x] https://github.com/lancedb/lance/issues/1957
[x] #2593
[x] #2333
[ ] Frame of reference encoding

Switchover

[x] Introduce concept of "max writer version" to lance datasets
[x] Add writer version to v2 writer which will control what encodings are used
[x] Add a migration command that will allow a dataset to switch over to v2 files and/or upgrade the writer version
[x] Clear documentation around version capabilities and benefits
[x] #2394

Columnar Encodings for Random Access

Design: https://docs.google.com/document/d/19QNZq7A-797CXt8Z5pCrDEcxcRxEE8J0_sw4goqqgIY/edit?usp=sharing

[ ] Breakdown into list of tasks

Benchmarking

[x] https://github.com/lancedb/lance/pull/2383
[ ] Python level take scan parameterized by (# columns, batch size, data types, metadata cached)
[ ] The above, but with very large items
[ ] The above, but with filter pushdown (e.g. statistics)
[ ] Few columns from very many columns (using true column projection)

Low Priority

[ ] https://github.com/lancedb/lance/issues/1960
[ ] Union type
[ ] Add support for fixed size list as a logical encoding
[ ] Potentially new encodings
- [ ] Compressed bitmap encoding
- [ ] Per-column dictionary encoding
- [ ] Delta encoding
[ ] Make it possible for users to supply their own encodings
[ ] Add example showing how to create a custom encoding
[ ] Allow specifying readahead as bytes instead of rows

Feb 08 '24 13:02 westonpace

Hey @westonpace, I'm intrigued by the v2 format and I'm looking into adding support for general compression. I'd like to explore the possibility of encoding each page's buffer with zstd compression, similar to Arrow IPC's record batch body buffer compression. However, Lance's v2 format seems to offer more flexibility, as different fields may use different page sizes.

I've taken a look at the code and glanced over the current implementation. It seems that logical encoders like PrimitiveFieldEncoder are hardcoded to use the physical encoder ValueEncoder internally. I believe the "General compression" encoder would be a type of physical encoder, but I'm unsure how to integrate this new physical encoder into ValueEncoder. Do you have any guidance on this? Additionally, do you think it's the right time to pursue such an enhancement, considering that this part of the codebase is still actively being developed? Thanks for any insights you can provide.

May 13 '24 04:05 niyue

Hello again @niyue :)

It seems that logical encoders like PrimitiveFieldEncoder are hardcoded to use the physical encoder ValueEncoder internally

You're right that there is a piece missing at the moment. There will need to be some kind of "encoding picker" API that will need to be extensible. This component often calculates some basic statistics to figure out which encoding would be best to apply. For example, encodings like RLE are often only applied if there is a small range of possible values. I think we will also want some kind of mechanism for user configuration but I'm not entirely sure what shape that will take yet (maybe field metadata). For now, I think we can probably choose whether or not to apply general compression based on an environment variable. Then we can hook it into the configuration mechanism later, once it's been developed. So, if the environment variable is set, all value buffers will have general compression applied. If it is not set, no buffers will.

Additionally, do you think it's the right time to pursue such an enhancement, considering that this part of the codebase is still actively being developed?

There will be some changes coming up. I had been planning on inviting others to help with encodings a little bit later (in maybe about a month). However, I think the actual API for physical encodings is pretty stable. If you want to make an attempt at adding compression I think it would be fine.

I believe the "General compression" encoder would be a type of physical encoder

I think you are right. The scheduler will be a bit interesting because we cannot determine the exact range to read when using general compression. So the page scheduler will simply need to load the entire range, and send the requested range to the decoder. Then, the decoder, after it applies the decompression, can select the parts that were asked for.

Longer term (can be a future PR) I would like a general compression encoding to be able to utilize a metadata buffer for a skip table. For example, even though we have an 8MB page we can compress it in 32KB chunks. We can record the number of values per chunk in the encoding description (e.g. if this is int32 we would have 8K values per chunk). For each chunk we can record the compressed size of the chunk. This would give us 256 sizes which should all be 16-bit values. We can then store this 512-byte buffer in one of the column metadata buffers. Then, during scheduling, if the user is asking for a specific row or small range of rows we can use this metadata buffer to figure out exactly which chunks we need to load, reducing the I/O for a small (512 byte) metadata cost.

There are some pieces needed (the ability to store metadata buffers) that are not yet ready for this longer term feature. I will be working on pushdown filtering soon and I expect the pieces we need will get developed then. However, I wanted to share where my thinking was on this.

May 13 '24 13:05 westonpace

Thanks for the great suggestions.

we can hook it into the configuration mechanism later, once it's been developed

Do we have a rough roadmap for when this might be developed? I'll follow your suggestion to start with an environment variable-controlled approach. However, in my use case, I anticipate applying general compression to specific fields only, which means we'll need some user configuration mechanism eventually.

I would like a general compression encoding to be able to utilize a metadata buffer for a skip table

even though we have an 8MB page we can compress it in 32KB chunks

This is essentially what I'd like to achieve. Initially, I thought it could be accomplished by having different data_cache_bytes write option for different columns, resulting in pages of varying sizes. However, your suggestion of employing a chunk in-page approach has me reconsidering. In my scenario, I aim to accelerate random access while maintaining reasonable compression. Sometimes it's challenging to determine the optimal compression method, so having the option for general compression could be beneficial.

May 14 '24 03:05 niyue

Do we have a rough roadmap for when this might be developed? I'll follow your suggestion to start with an environment variable-controlled approach. However, in my use case, I anticipate applying general compression to specific fields only, which means we'll need some user configuration mechanism eventually.

Currently I was planning on adding pushdown predicates and robustness testing this month, with the hope of making lance v2 the default for lance datasets by the end of the month.

After that I was planning on making the encodings more extensible, so that others could start developing encodings. I think adding configuration would be part of this work. So I would estimate it should be ready by the end of June.

I aim to accelerate random access while maintaining reasonable compression.

This is our goal as well :) Since most of our queries on the inference path are vector searches this means we need to do a lot of "select X rows by offset" and so point lookups are important. However, we want to balance this will full scans since those are very common in the training path.

My thinking is that bitpacking, frame of reference and delta are good first encodings. It's pretty cheap to determine if they will be beneficial and there is no affect on random access. RLE, FSST, dictionary, and general compression are the next set. These do have some affect on random access but, if the chunk sizes are small enough, hopefully it won't be too significant. I also think various sentinel encodings are important too because they avoid an IOP during a point lookup.

I have others that are starting to help me on this encodings work and so it will probably happen in parallel with the things I mentioned above. Bitpacking was just opened today: https://github.com/lancedb/lance/pull/2333

May 14 '24 05:05 westonpace

However, in my use case, I anticipate applying general compression to specific fields only, which means we'll need some user configuration mechanism eventually.

Do you think field metadata will be a good tool for users to specify this configuration? Or do you have any other idea?

May 14 '24 05:05 westonpace

Thanks for the insight.

RLE, FSST, dictionary, and general compression are the next set

Experimenting with general compression is useful in my scenario, especially since it can be applied to all types of data, whether integer, float, or string. This flexibility could prove Lance as a viable format for my project, even without additional encodings. Currently, we utilize dictionary encoding for low cardinality fields, and I may explore incorporating dictionary encoding later on. I also experimented with FSST previously, as documented here, but it seems more suited for short strings and has specific application domains.

Do you think field metadata will be a good tool for users to specify this configuration?

Using field metadata to specify configuration seems like a useful approach. In my project, we currently utilize Arrow IPC with multiple record batches to store a portion of the data. We aim to support both point queries and analytical queries that involve scanning large amounts of data. Currently, we chunk a field in an IPC file into multiple record batches, dynamically calculating the chunk size based on the average size of the field. To ensure the file is self-contained, we store the chunk size in the IPC file as customized metadata, which IPC file natively supports, allowing readers to access the file without additional external metadata. Lance v2 format appears more flexible, and I'm considering leveraging it to enable multiple fields to have different chunk sizes, thus enhancing the efficiency of randomly accessing these fields. This is particularly crucial as some fields are large, while others are trivial in size.

May 14 '24 06:05 niyue

regarding Sentinel encoding for nulls, for datatype boolean, i guess we can chose whatever value that is not false, true for datatypes like timestamp, Date32, Date64, Time32, Time64, Duration, Interval, since these types use signed integers underneath and valid values are always non-negative, we can chose a negative value as the sentinel. but for other datatypes like int, uint, float, etc., how can we pick a sentinel value for them? any insights @westonpace @niyue ?

May 16 '24 15:05 broccoliSpicy

for datatype boolean, i guess we can chose whatever value that is not false, true

Well boolean is difficult because we usually represent them as bits, so there's no value other than 0 or 1.

how can we pick a sentinel value for them?

I think during the encoding process we'll collect statistics for arrays, such as min, max, null count, distinct count. These will be saved for page skipping, but also be used to decide how to encode the page. An easy way to find a sentinel value would be max+1 or min-1, if these don't overflow. If this doesn't give a match, we can either scan for an unused value or simply choose a bitmap null encoding.

May 16 '24 19:05 wjones127

@westonpace

I have drafted a PR (https://github.com/lancedb/lance/pull/2368) to add support for compressing the value page buffer. Could you please review it to see if it fits well? And please let me know if a new issue should be opened for this PR.

As I am relatively new to Lance and Rust, there might be some mistakes in the PR. Please excuse any oversights. I am also uncertain if the current solution is the best fit for Lance. If it isn't, feel free to reject this PR. I am open to suggestions and willing to give it another try if we can figure out a better approach to address this issue. Thanks.

May 22 '24 10:05 niyue

I've cleaned up the task list a bit, removing completed items, and restructuring a bit. We have a pretty solid set of basic encodings. There are a few "completion tasks" that need to be done to round out the capabilities. At the same time I have come up with a design for new struct/list encodings that better support random access. I plan to be working on this over the next month or two. I'd appreciate any feedback on the document: https://docs.google.com/document/d/19QNZq7A-797CXt8Z5pCrDEcxcRxEE8J0_sw4goqqgIY/edit?usp=sharing

CC @niyue / @broccoliSpicy who may be interested.

Jul 18 '24 19:07 westonpace

so excited to see your ideas on struct/list encodings! @westonpace

Jul 18 '24 19:07 broccoliSpicy

a few thoughts about the doc:

during integer encoding, if we group integers in size of 1024, then when the bit-width is >10 bits, it is guaranteed to find a sentinel through constant * 1024 runs of XOR operations
is there any reason we don't like to store 2 copies of null bitmap? one in another page for the whole column, for full scan queries, one in the same page with this page data, for random accesses. one benefit doing so is that during compression and decompression, we don't need to mask out this one-bit in front of the number or change the data layout to cover this one-bit in front of the number
for structs with only fixed-width fields, ideas like row groups may be applied inside the struct to accelerate queries like select struct.a + 3.14 from table

Jul 27 '24 23:07 broccoliSpicy

during integer encoding, if we group integers in size of 1024, then when the bit-width is >10 bits, it is guaranteed to find a sentinel through constant * 1024 runs of XOR operations

sorry, after rethinking about this, I think this is not feasible using only constant * 1024 runs of XOR operations, there might be many missing numbers

Jul 28 '24 19:07 broccoliSpicy

sorry, after rethinking about this, I think this is not feasible using only constant * 1024 runs of XOR operations, there might be many missing numbers

No worries. If we come up with a good algorithm at any point we can always plug it in.

is there any reason we don't like to store 2 copies of null bitmap? one in another page for the whole column, for full scan queries, one in the same page with this page data, for random accesses. one benefit doing so is that during compression and decompression, we don't need to mask out this one-bit in front of the number or change the data layout to cover this one-bit in front of the number

Unfortunately, I think we'd need to store two copies of the data. Because, even if we have the compressed null bitmap we still need to read the data and it would have the null bit attached to it.

I do think we might not take the zipped nulls approach for integer / fp data. For example, if you have integers and you have bitpacking then, in most cases, I expect you will be able to store 1024 integers AND the compressed null bitmap for that block in less than one 4KB disk sector. So I expect zipped nulls will be most useful for larger data types and the overhead of unzipping should be fairly minor.

for structs with only fixed-width fields, ideas like row groups may be applied inside the struct to accelerate queries like select struct.a + 3.14 from table

Can you expand on this?

Jul 29 '24 13:07 westonpace

sorry, after rethinking about this, I think this is not feasible using only constant * 1024 runs of XOR operations, there might be many missing numbers

No worries. If we come up with a good algorithm at any point we can always plug it in.

is there any reason we don't like to store 2 copies of null bitmap? one in another page for the whole column, for full scan queries, one in the same page with this page data, for random accesses. one benefit doing so is that during compression and decompression, we don't need to mask out this one-bit in front of the number or change the data layout to cover this one-bit in front of the number

Unfortunately, I think we'd need to store two copies of the data. Because, even if we have the compressed null bitmap we still need to read the data and it would have the null bit attached to it.

I do think we might not take the zipped nulls approach for integer / fp data. For example, if you have integers and you have bitpacking then, in most cases, I expect you will be able to store 1024 integers AND the compressed null bitmap for that block in less than one 4KB disk sector. So I expect zipped nulls will be most useful for larger data types and the overhead of unzipping should be fairly minor.

for structs with only fixed-width fields, ideas like row groups may be applied inside the struct to accelerate queries like select struct.a + 3.14 from table

Can you expand on this?

sorry for the delay of response, I will find sometime to read the doc a few more times and get back to you

Aug 01 '24 16:08 broccoliSpicy