pramen
pramen copied to clipboard
Allow metastore tables having `delta` format not be partitioned
Background
Partitioning of Delta Lake tables might actually worsen the efficiency of reads, especially for small tables. https://delta.io/blog/pros-cons-hive-style-partionining/ https://delta.io/blog/2023-06-03-delta-lake-z-order/
This feature is about adding a flag to make metastore tables not partitioned. The information date column should still be added, probably as the first column so that Z-order or Liquid Clustering can take effect.
Feature
Allow metastore tables having delta
format not be partitioned.
Example
pramen.metastore {
tables = [
{
name = "table1"
format = "delta"
path = "s3://bucket/prefix/table1"
partitioned = false
zorder = [ "info_date", "id" ]
}
]
}
Proposed Solution
Solution Ideas
- Make sure the logic is the same when working with partitioned and non-partitioned delta tables
- Put the information date column at the beginning for partitioned tables so that it is part of Z-order or Liquid Clustering
- Make sure parallel writes to different info dates is not possible when the data is not partitioned by information date
- Use Z-order by the infomration date column and by batchid column by default, but allow users to specify z-order in config.