delta-rs
delta-rs copied to clipboard
Liquid clustering
Description
Use Case To my understanding liquid clustering would share a lot of the code paths as to Z-order and would be part of optimize.
I think we only need to create a rust udf similar to the z-order that does Hilbert clustering.
I would need to do some more reading on the algorithm but it could be some low hanging fruit considering it likely shares a bunch of code paths.
Related Issue(s)
My understanding is that liquid clustering is not low hanging fruit it would require significant changes to how we write data. When writing data the hive-style convention is followed where partition values are stored in the path and partition values are not written to the physical parquet files. With liquid they discard the hive-style conventions so we will need to accommodate that.
@Blajda ah my bad, then i misunderstood the complexity of the design document. I thought it was similar to Z-order as info using algorithm Y to collocate certain rows and then just write without partitioning
When writing data the hive-style convention is followed where partition values are stored in the path and partition values are not written to the physical parquet files. With liquid they discard the hive-style conventions so we will need to accommodate that.
We shouldn't rely on the Hive-style paths at all in our codebase. Do we? The partition values are supposed to be read from the log, not the file path. To quote the protocol (emphasis mine):
This directory format is only used to follow existing conventions and is not required by the protocol. Actual partition values for a file must be read from the transaction log.
@wjones127 Yes I don't recall any explicit dependency on hive style paths. My primary concern is that tables that use liquid clustering do not allow for partitions hence it may requires some changes from the writers. It might be enough to disable partitioning on the table during creation and simply perform hilbert curves during the write.
Liquid Clustering has no proper public "specification", so the comedy option here is that we could implement this before Delta/Spark has it outside of the proprietary DBR :laughing: :clown_face:
Liquid Clustering has no proper public "specification", so the comedy option here is that we could implement this before Delta/Spark has it outside of the proprietary DBR :laughing: :clown_face:
That would be actually pretty hilarious 😂
Liquid Clustering has no proper public "specification", so the comedy option here is that we could implement this before Delta/Spark has it outside of the proprietary DBR
FWIW there is a design doc up with some details, but I don't think it's enough detail for us to implement something compatible with the existing Databricks implementation.
https://github.com/delta-io/delta/issues/1874
https://docs.google.com/document/d/1FWR3odjOw4v4-hjFy_hVaNdxHVs4WuK1asfB6M6XEMw/edit#heading=h.skpz7c7ga1wl
@wjones127 how about this commit? https://github.com/andreaschat-db/delta/commit/2f33bf680a63b8070fac91561d035e93088c4f73
I'm more interested in this on the reader side then the writer.
I'm more interested in this on the reader side then the writer.
for this I think we need v2 checkpoint support, as this gets enabled by liquid clustering. on the data side though there should be no additional changes needed. unfortuantely re-writing our chepointing is a bit of a bigger effort.