delta
delta copied to clipboard
Roadmap 2022 H1 (discussion)
This is the proposed Delta Lake 2022 H1 roadmap discussion thread. Below are the initially proposed items for the roadmap to be completed by June 2022. We will also be sending out a survey (we will update this issue with the survey) to get more feedback from the Delta Lake community!
Performance Optimizations
Based on the overwhelming feedback from the Delta Users Slack, Google Groups, Community AMAs (on Delta Lake YouTube), Delta Lake 2021H2 survey, and 2021H2 roadmap, we propose the following Delta Lake performance enhancements in the next two quarters.
Issue | Description | Target CY2022 |
---|---|---|
927 | OPTIMIZE (file compaction): Table optimize is an operation to rearrange the data and/or metadata to speed up queries and/or reduce the metadata size | Released in 1.2 |
923 | File skipping using columns stats: This is a performance optimization that aims at speeding up queries that contain filters (WHERE clauses) on non-partitionBy columns. | Released in 1.2 |
931 | Automatic data skipping using generated columns: Enhance generated columns to include automatic data skipping | Released in 1.2 |
1134 | OPTIMIZE ZORDER: Data clustering via multi-column locality-preserving space-filling curves with offline sorting. | Q3/Q4 |
MERGE Performance Improvements: We will be providing a project improvement plan (PIP) document shortly on the proposed design for discussion. | Q2/Q3 |
Schema Operations
For this year, our focus will be on columnar mappings.
Issue | Description | Target CY2022 |
---|---|---|
958 | Support for renaming column: Rename column with ALTER TABLE | Released in 1.2 |
957 | Support for arbitrary column names: Support characters in column names not allowed by Parquet | Released in 1.2 |
Support for dropping columns: Drop column with ALTER TABLE | Q2 | |
348 | Support for dynamic partition overwrite: Currently you can overwrite using the replaceWhere option but in various scenarios, it is more convenient to specify overwrite partition. |
Q2 |
Integrations
Extending from the recent releases of PrestoDB, Hive 3, and Delta Sink for Apache Flink Streams API, we have additional integrations planned.
Issue | Description | Target CY2022 |
---|---|---|
112 | Delta Source for Apache Pulsar: Build a Pulsar/Delta reader leveraging Delta Standalone. Join us via the Delta Users Slack #connector-pulsar channel. | Q2 |
238 | Flink Sink on Table API: Build a Flink/Delta sink (i.e., Flink writes to Delta Lake) using the Apache Flink Table API. Join us via the Delta Users Slack #flink-delta-connector channel and we have bi-weekly meetings on Tuesdays. | Q2/Q3 |
110 | Delta Source for Apache Flink: Build a Flink/Delta source (i.e., Flink reads from Delta Lake) leveraging Delta Standalone. Join us via the Delta Users Slack #flink-delta-connector channel and we have bi-weekly meetings on Tuesdays. | Q2/Q3 |
82 | Delta Source for Trino: Joint Delta Lake and Trino community collaboration on the following PRs: 10987, 10300. This is a community effort and all are welcome! Join us via the Delta User Slack channel #trino channel and we will have bi-weekly meetings on Thursdays. | Released |
Delta Source for Big Query: Allows Big Query to natively read Delta Lake tables. | Q2/Q3 | |
523, 566 | Delta Rust Writer: Extending Delta Rust API to write to Delta Lake. | Q2/Q3 |
Hive/Delta writer: Extending Hive to write to Delta Lake | Q3 |
Operations Enhancements
Two very popular requests are planned for this semester: Table Restore, S3 multi-cluster writes.
Issue | Description | Target CY2022 |
---|---|---|
903, 863 | Table Restore: Rollback to a previous version of a Delta table using Python, Scala, and/or SQL APIs. | Released in 1.2 |
41 | S3 multi-cluster writes: Allows multiple clusters/drivers/JVMs to concurrently write to S3 using DynanoDB as the lock store. Please refer to this PIP: [2021-12-22] Delta OSS S3 Multi-Cluster Writes | Released in 1.2 |
747 | delta.io.Guide: Enhance the Delta Lake documentation by creating a new guide (PIP will follow soon) | Q2/Q3 |
Iceberg to Delta Converter: Ability to convert Iceberg table to Delta table without a rewrite. | Q3 | |
Table Cloning: Clones a source Delta table to a target destination at a specific version. A clone can be either deep or shallow: deep clones copy over the data from the source and shallow clones do not. | Q3 | |
1105 | Change Data Feed: The Delta change data feed represents row-level changes between versions of a Delta table. When enabled on a Delta table, the runtime records “change events” for all the data written into the table. | Q2 |
Updates
- 2022-05-18: Include Issue 348 for the dynamic partition overwrite feature request
- 2022-05-03: Updated tables with Delta Lake 1.2 release.
- 2022-03-08: Based on community feedback, we are also prioritizing Hive/Delta writer, clones, and CDF
If there are other issue that should be considered within this roadmap, let's have a discussion here or via the Delta Users Slack #deltalake-oss channel.
Coincidentally good timing, together with The Onehouse announcement yesterday!!!
Auto optimize would be a great addition too. From the Databricks docs it looks like there are two types:
- Optimize writes: does some kind of smart repartition before writing. Is this possible in current Spark? Or does this involve something special in the Databricks Spark fork/Delta engine? The tricky part to me seems to be not grouping too much data into a single partition and getting a single file per partition, while dealing with potentially skewed data.
- Auto compaction: this seems straightforward after the OPTIMIZE is implemented. My main question is is this (or should it be) a two commit process (commit original files then just trigger a compaction and commit the compaction result), or in this case does it do the compaction before committing at all?
Created an item for Big Query connector: https://github.com/delta-io/connectors/issues/282
It would be great if the CDF was open source on the latest date. I really interest with this feature!
@novemberdude agree, add cdf please!
Thanks for your feedback @novemberdude and @Shadlezzz - we’ll definitely take this into consideration!
Would love to see a built-in solution for implementing a retention policy / archiving delta data on append-only tables - this would be a huge help for my team!
It would be great if the CDF was open source on the latest date. I really interest with this feature!
Based on your feedback from here and Slack, we've added CDF, Cloning, and Hive/Delta writer to the roadmap. HTH!
Would love to see a built-in solution for implementing a retention policy / archiving delta data on append-only tables - this would be a huge help for my team!
Hi @sliu4 - there are a number of possible solutions to what you're facing. Could you provide the context here and/or do not hesitate to chime in the Delta Users slack. This may be worthy of us developing a PIP (project improvement proposal) so we can get more feedback on design.
hi @dennyglee - we have several very large prod level delta tables that we would like to gradually archive/glacierize and then we also have some delta data that we just want to delete outright after a certain period of time.
I know from speaking to our Databricks reps that there are solutions we can implement ourselves. For the first case, we can set up a view on the prod level data with a filter based on our glaciering schedule and advise users against querying the s3 path directly. For the second case, we can set up an automated job to do a DELETE and VACUUM.
Implementing these solutions is possible but currently has to be done on a case by case basis. Ideally we'd like a built in feature that can address this in a systematic way. We had hoped to use lifecycle policies and apply them across multiple buckets, but we know this doesn't play nicely with delta and the transaction log.
hi @dennyglee - we have several very large prod level delta tables that we would like to gradually archive/glacierize and then we also have some delta data that we just want to delete outright after a certain period of time.
I know from speaking to our Databricks reps that there are solutions we can implement ourselves. For the first case, we can set up a view on the prod level data with a filter based on our glaciering schedule and advise users against querying the s3 path directly. For the second case, we can set up an automated job to do a DELETE and VACUUM.
Implementing these solutions is possible but currently has to be done on a case by case basis. Ideally we'd like a built in feature that can address this in a systematic way. We had hoped to use lifecycle policies and apply them across multiple buckets, but we know this doesn't play nicely with delta and the transaction log.
Hi @sliu4 - this is super interesting and while I do think the "devil is in the details", the concept of a lifecycle policy may in fact work well with the context of Delta's transaction log since we would be able to use the transaction log to determine what to DELETE/VACUUM, etc. It implies that the lifecycle policy itself would categorize different tables with different policy granular levels (e.g. HBI, MBI, LBI or GDPR compliance etc.), read the Delta transaction log, and then initiate the process within that context to ensure transactional consistency when running the lifecycle policy. Adding to this, you could probably utilize user metadata as a way to track a single life cycle policy across multiple tables and/or create a policy table that includes the table/version numbers for the associated application of policy. It would be worth diving into more - if you're up for it, ping me on Slack and we can find a time to dive in further, eh?! HTH! Denny
Following .
@dennyglee I see that you have CLONE
support as a Q3 priority. I'm wondering if that's still on track? and is the goal to fully migrate the closed source databricks implementation into this code base?
Hey @hoffrocket - let me get back to you on the timing of this - thanks!
@dennyglee I see that you have
CLONE
support as a Q3 priority. I'm wondering if that's still on track? and is the goal to fully migrate the closed source databricks implementation into this code base?
This feature is very important for data DR.
Not to pile on, but also very keen on the CLONE
functionality :-). It's the last blocker for us to guild out our full DR plan and it would really be great to keep this one on track for Q3 if possible.
We're trying our best @p2bauer - we're going to send out the proposed priorities in the next week or so for all of us to review and help prioritize. Thanks!
Closing this issue as we're working on #1307