delta-rs icon indicating copy to clipboard operation
delta-rs copied to clipboard

Liquid clustering

Open ion-elgreco opened this issue 1 year ago • 11 comments

Description

Use Case To my understanding liquid clustering would share a lot of the code paths as to Z-order and would be part of optimize.

I think we only need to create a rust udf similar to the z-order that does Hilbert clustering.

I would need to do some more reading on the algorithm but it could be some low hanging fruit considering it likely shares a bunch of code paths.

Related Issue(s)

ion-elgreco avatar Jan 06 '24 21:01 ion-elgreco

My understanding is that liquid clustering is not low hanging fruit it would require significant changes to how we write data. When writing data the hive-style convention is followed where partition values are stored in the path and partition values are not written to the physical parquet files. With liquid they discard the hive-style conventions so we will need to accommodate that.

Blajda avatar Jan 06 '24 21:01 Blajda

@Blajda ah my bad, then i misunderstood the complexity of the design document. I thought it was similar to Z-order as info using algorithm Y to collocate certain rows and then just write without partitioning

ion-elgreco avatar Jan 06 '24 21:01 ion-elgreco

When writing data the hive-style convention is followed where partition values are stored in the path and partition values are not written to the physical parquet files. With liquid they discard the hive-style conventions so we will need to accommodate that.

We shouldn't rely on the Hive-style paths at all in our codebase. Do we? The partition values are supposed to be read from the log, not the file path. To quote the protocol (emphasis mine):

This directory format is only used to follow existing conventions and is not required by the protocol. Actual partition values for a file must be read from the transaction log.

wjones127 avatar Jan 06 '24 22:01 wjones127

@wjones127 Yes I don't recall any explicit dependency on hive style paths. My primary concern is that tables that use liquid clustering do not allow for partitions hence it may requires some changes from the writers. It might be enough to disable partitioning on the table during creation and simply perform hilbert curves during the write.

Blajda avatar Jan 07 '24 00:01 Blajda

Liquid Clustering has no proper public "specification", so the comedy option here is that we could implement this before Delta/Spark has it outside of the proprietary DBR :laughing: :clown_face:

rtyler avatar Jan 07 '24 00:01 rtyler

Liquid Clustering has no proper public "specification", so the comedy option here is that we could implement this before Delta/Spark has it outside of the proprietary DBR :laughing: :clown_face:

That would be actually pretty hilarious 😂

ion-elgreco avatar Jan 07 '24 00:01 ion-elgreco

Liquid Clustering has no proper public "specification", so the comedy option here is that we could implement this before Delta/Spark has it outside of the proprietary DBR

FWIW there is a design doc up with some details, but I don't think it's enough detail for us to implement something compatible with the existing Databricks implementation.

https://github.com/delta-io/delta/issues/1874

https://docs.google.com/document/d/1FWR3odjOw4v4-hjFy_hVaNdxHVs4WuK1asfB6M6XEMw/edit#heading=h.skpz7c7ga1wl

wjones127 avatar Jan 07 '24 04:01 wjones127

@wjones127 how about this commit? https://github.com/andreaschat-db/delta/commit/2f33bf680a63b8070fac91561d035e93088c4f73

ion-elgreco avatar Jan 07 '24 11:01 ion-elgreco

I'm more interested in this on the reader side then the writer.

jabbera avatar Aug 07 '24 12:08 jabbera

I'm more interested in this on the reader side then the writer.

for this I think we need v2 checkpoint support, as this gets enabled by liquid clustering. on the data side though there should be no additional changes needed. unfortuantely re-writing our chepointing is a bit of a bigger effort.

roeap avatar Aug 07 '24 21:08 roeap