pramen icon indicating copy to clipboard operation
pramen copied to clipboard

Allow metastore tables having `delta` format not be partitioned

Open yruslan opened this issue 4 months ago • 0 comments

Background

Partitioning of Delta Lake tables might actually worsen the efficiency of reads, especially for small tables. https://delta.io/blog/pros-cons-hive-style-partionining/ https://delta.io/blog/2023-06-03-delta-lake-z-order/

This feature is about adding a flag to make metastore tables not partitioned. The information date column should still be added, probably as the first column so that Z-order or Liquid Clustering can take effect.

Feature

Allow metastore tables having delta format not be partitioned.

Example

pramen.metastore {
  tables = [
    {
      name = "table1"
      format = "delta"
      path = "s3://bucket/prefix/table1"
      partitioned = false
      zorder = [ "info_date", "id" ]
    }
  ]
}

Proposed Solution

Solution Ideas

  1. Make sure the logic is the same when working with partitioned and non-partitioned delta tables
  2. Put the information date column at the beginning for partitioned tables so that it is part of Z-order or Liquid Clustering
  3. Make sure parallel writes to different info dates is not possible when the data is not partitioned by information date
  4. Use Z-order by the infomration date column and by batchid column by default, but allow users to specify z-order in config.

yruslan avatar Oct 02 '24 10:10 yruslan