delta Roadmap 2022 H1 (discussion)

This is the proposed Delta Lake 2022 H1 roadmap discussion thread. Below are the initially proposed items for the roadmap to be completed by June 2022. We will also be sending out a survey (we will update this issue with the survey) to get more feedback from the Delta Lake community!

Performance Optimizations

Based on the overwhelming feedback from the Delta Users Slack, Google Groups, Community AMAs (on Delta Lake YouTube), Delta Lake 2021H2 survey, and 2021H2 roadmap, we propose the following Delta Lake performance enhancements in the next two quarters.

Issue	Description	Target CY2022
927	OPTIMIZE (file compaction): Table optimize is an operation to rearrange the data and/or metadata to speed up queries and/or reduce the metadata size	Released in 1.2
923	File skipping using columns stats: This is a performance optimization that aims at speeding up queries that contain filters (WHERE clauses) on non-partitionBy columns.	Released in 1.2
931	Automatic data skipping using generated columns: Enhance generated columns to include automatic data skipping	Released in 1.2
1134	OPTIMIZE ZORDER: Data clustering via multi-column locality-preserving space-filling curves with offline sorting.	Q3/Q4
	MERGE Performance Improvements: We will be providing a project improvement plan (PIP) document shortly on the proposed design for discussion.	Q2/Q3

Schema Operations

For this year, our focus will be on columnar mappings.

Issue	Description	Target CY2022
958	Support for renaming column: Rename column with ALTER TABLE	Released in 1.2
957	Support for arbitrary column names: Support characters in column names not allowed by Parquet	Released in 1.2
	Support for dropping columns: Drop column with ALTER TABLE	Q2
348	Support for dynamic partition overwrite: Currently you can overwrite using the `replaceWhere` option but in various scenarios, it is more convenient to specify overwrite partition.	Q2

Integrations

Extending from the recent releases of PrestoDB, Hive 3, and Delta Sink for Apache Flink Streams API, we have additional integrations planned.

Issue	Description	Target CY2022
112	Delta Source for Apache Pulsar: Build a Pulsar/Delta reader leveraging Delta Standalone. Join us via the Delta Users Slack #connector-pulsar channel.	Q2
238	Flink Sink on Table API: Build a Flink/Delta sink (i.e., Flink writes to Delta Lake) using the Apache Flink Table API. Join us via the Delta Users Slack #flink-delta-connector channel and we have bi-weekly meetings on Tuesdays.	Q2/Q3
110	Delta Source for Apache Flink: Build a Flink/Delta source (i.e., Flink reads from Delta Lake) leveraging Delta Standalone. Join us via the Delta Users Slack #flink-delta-connector channel and we have bi-weekly meetings on Tuesdays.	Q2/Q3
82	Delta Source for Trino: Joint Delta Lake and Trino community collaboration on the following PRs: 10987, 10300. This is a community effort and all are welcome! Join us via the Delta User Slack channel #trino channel and we will have bi-weekly meetings on Thursdays.	Released
	Delta Source for Big Query: Allows Big Query to natively read Delta Lake tables.	Q2/Q3
523, 566	Delta Rust Writer: Extending Delta Rust API to write to Delta Lake.	Q2/Q3
	Hive/Delta writer: Extending Hive to write to Delta Lake	Q3

Operations Enhancements

Two very popular requests are planned for this semester: Table Restore, S3 multi-cluster writes.

Issue	Description	Target CY2022
903, 863	Table Restore: Rollback to a previous version of a Delta table using Python, Scala, and/or SQL APIs.	Released in 1.2
41	S3 multi-cluster writes: Allows multiple clusters/drivers/JVMs to concurrently write to S3 using DynanoDB as the lock store. Please refer to this PIP: [2021-12-22] Delta OSS S3 Multi-Cluster Writes	Released in 1.2
747	delta.io.Guide: Enhance the Delta Lake documentation by creating a new guide (PIP will follow soon)	Q2/Q3
	Iceberg to Delta Converter: Ability to convert Iceberg table to Delta table without a rewrite.	Q3
	Table Cloning: Clones a source Delta table to a target destination at a specific version. A clone can be either deep or shallow: deep clones copy over the data from the source and shallow clones do not.	Q3
1105	Change Data Feed: The Delta change data feed represents row-level changes between versions of a Delta table. When enabled on a Delta table, the runtime records “change events” for all the data written into the table.	Q2

Updates

2022-05-18: Include Issue 348 for the dynamic partition overwrite feature request
2022-05-03: Updated tables with Delta Lake 1.2 release.
2022-03-08: Based on community feedback, we are also prioritizing Hive/Delta writer, clones, and CDF

If there are other issue that should be considered within this roadmap, let's have a discussion here or via the Delta Users Slack #deltalake-oss channel.

Feb 03 '22 03:02 dennyglee

Coincidentally good timing, together with The Onehouse announcement yesterday!!!

Feb 03 '22 14:02 subramanianmohan

Auto optimize would be a great addition too. From the Databricks docs it looks like there are two types:

Optimize writes: does some kind of smart repartition before writing. Is this possible in current Spark? Or does this involve something special in the Databricks Spark fork/Delta engine? The tricky part to me seems to be not grouping too much data into a single partition and getting a single file per partition, while dealing with potentially skewed data.
Auto compaction: this seems straightforward after the OPTIMIZE is implemented. My main question is is this (or should it be) a two commit process (commit original files then just trigger a compaction and commit the compaction result), or in this case does it do the compaction before committing at all?

Feb 18 '22 11:02 Kimahriman

Created an item for Big Query connector: https://github.com/delta-io/connectors/issues/282

Feb 22 '22 06:02 vinijaiswal

It would be great if the CDF was open source on the latest date. I really interest with this feature!

Feb 26 '22 17:02 novemberdude

@novemberdude agree, add cdf please!

Mar 04 '22 20:03 Shadlezzz

Thanks for your feedback @novemberdude and @Shadlezzz - we’ll definitely take this into consideration!

Mar 06 '22 05:03 dennyglee

Would love to see a built-in solution for implementing a retention policy / archiving delta data on append-only tables - this would be a huge help for my team!

Mar 08 '22 21:03 sliu4

It would be great if the CDF was open source on the latest date. I really interest with this feature!

Based on your feedback from here and Slack, we've added CDF, Cloning, and Hive/Delta writer to the roadmap. HTH!

Mar 09 '22 03:03 dennyglee

Would love to see a built-in solution for implementing a retention policy / archiving delta data on append-only tables - this would be a huge help for my team!

Hi @sliu4 - there are a number of possible solutions to what you're facing. Could you provide the context here and/or do not hesitate to chime in the Delta Users slack. This may be worthy of us developing a PIP (project improvement proposal) so we can get more feedback on design.

Mar 09 '22 03:03 dennyglee

hi @dennyglee - we have several very large prod level delta tables that we would like to gradually archive/glacierize and then we also have some delta data that we just want to delete outright after a certain period of time.

I know from speaking to our Databricks reps that there are solutions we can implement ourselves. For the first case, we can set up a view on the prod level data with a filter based on our glaciering schedule and advise users against querying the s3 path directly. For the second case, we can set up an automated job to do a DELETE and VACUUM.

Implementing these solutions is possible but currently has to be done on a case by case basis. Ideally we'd like a built in feature that can address this in a systematic way. We had hoped to use lifecycle policies and apply them across multiple buckets, but we know this doesn't play nicely with delta and the transaction log.

Mar 14 '22 15:03 sliu4

hi @dennyglee - we have several very large prod level delta tables that we would like to gradually archive/glacierize and then we also have some delta data that we just want to delete outright after a certain period of time.

I know from speaking to our Databricks reps that there are solutions we can implement ourselves. For the first case, we can set up a view on the prod level data with a filter based on our glaciering schedule and advise users against querying the s3 path directly. For the second case, we can set up an automated job to do a DELETE and VACUUM.

Implementing these solutions is possible but currently has to be done on a case by case basis. Ideally we'd like a built in feature that can address this in a systematic way. We had hoped to use lifecycle policies and apply them across multiple buckets, but we know this doesn't play nicely with delta and the transaction log.

Hi @sliu4 - this is super interesting and while I do think the "devil is in the details", the concept of a lifecycle policy may in fact work well with the context of Delta's transaction log since we would be able to use the transaction log to determine what to DELETE/VACUUM, etc. It implies that the lifecycle policy itself would categorize different tables with different policy granular levels (e.g. HBI, MBI, LBI or GDPR compliance etc.), read the Delta transaction log, and then initiate the process within that context to ensure transactional consistency when running the lifecycle policy. Adding to this, you could probably utilize user metadata as a way to track a single life cycle policy across multiple tables and/or create a policy table that includes the table/version numbers for the associated application of policy. It would be worth diving into more - if you're up for it, ping me on Slack and we can find a time to dive in further, eh?! HTH! Denny

Mar 15 '22 01:03 dennyglee

Following .

Mar 22 '22 15:03 skhurana333

@dennyglee I see that you have CLONE support as a Q3 priority. I'm wondering if that's still on track? and is the goal to fully migrate the closed source databricks implementation into this code base?

Jun 17 '22 18:06 hoffrocket

Hey @hoffrocket - let me get back to you on the timing of this - thanks!

Jun 21 '22 03:06 dennyglee

@dennyglee I see that you have CLONE support as a Q3 priority. I'm wondering if that's still on track? and is the goal to fully migrate the closed source databricks implementation into this code base?

This feature is very important for data DR.

Jun 21 '22 04:06 Qiuzhuang

Not to pile on, but also very keen on the CLONE functionality :-). It's the last blocker for us to guild out our full DR plan and it would really be great to keep this one on track for Q3 if possible.

Jun 30 '22 22:06 p2bauer

We're trying our best @p2bauer - we're going to send out the proposed priorities in the next week or so for all of us to review and help prioritize. Thanks!

Jul 11 '22 22:07 dennyglee

Closing this issue as we're working on #1307

Aug 02 '22 04:08 dennyglee

delta delta copied to clipboard

Roadmap 2022 H1 (discussion)

Performance Optimizations

Schema Operations

Integrations

Operations Enhancements

Updates

delta
delta copied to clipboard