Delta Lake connector
- [x] https://github.com/trinodb/trino/pull/10897
- [x] https://github.com/trinodb/trino/issues/11297
- [x] https://github.com/trinodb/trino/issues/11299
- [x] https://github.com/trinodb/trino/issues/11300
- [x] https://github.com/trinodb/trino/issues/11325
- [x] https://github.com/trinodb/trino/issues/11369
- [ ] https://github.com/trinodb/trino/issues/12004
- [ ] https://github.com/trinodb/trino/issues/12005
- [x] https://github.com/trinodb/trino/issues/12007
- [ ] https://github.com/trinodb/trino/issues/12008
- [x] https://github.com/trinodb/trino/issues/12009
- [ ] https://github.com/trinodb/trino/issues/12011
- [x] https://github.com/trinodb/trino/issues/12012
- [x] https://github.com/trinodb/trino/issues/12013
- [x] https://github.com/trinodb/trino/issues/12014
- [ ] https://github.com/trinodb/trino/issues/12018
- [ ] https://github.com/trinodb/trino/issues/12028
- [ ] https://github.com/trinodb/trino/issues/12029
- [x] https://github.com/trinodb/trino/issues/12030
- [x] https://github.com/trinodb/trino/issues/12031
- [ ] https://github.com/trinodb/trino/issues/12032
- [x] https://github.com/trinodb/trino/issues/12033
- [x] https://github.com/trinodb/trino/issues/12034
- [ ] https://github.com/trinodb/trino/issues/12040
- [x] https://github.com/trinodb/trino/issues/12041
- [x] https://github.com/trinodb/trino/issues/13169
- [x] https://github.com/trinodb/trino/issues/13538
- [x] https://github.com/trinodb/trino/issues/15894
- [ ] https://github.com/trinodb/trino/issues/16985
cc @jirassimok @alexjo2144
Based on TODOs in code I created following issues related to Delta Lake connector:
- Enforce planned frequency for checkpoints #12004
- Write Delta Lake "operationMetrics" Transaction Log Field #12005
- Add Timestamp predicate pushdown to Parquet in Delta Lake #12007
- Harden AWS transaction locking mechanism to support long running I/O operations #12008
- Move TestDeltaLakeAdlsConnectorSmokeTest.testDropSchemaExternalFiles to base class #12009
- Add tests for SelectedPortWaitStrategy #12010
- DeltaLake cleanupFailedWrite should happen in a backround thread #12011
- DeltaLake support delta.hide-non-delta-lake-tables with file metastore #12012
- DeltaLake add parameters == null check to DefaultGlueMetastoreTableFilterProvider.isDeltaLakeTable #12013
- DeltaLake support delta.hide-non-delta-lake-tables with thrift metastore #12014
- DeltaLake Consider removing directories during vacuum #12018
- DeltaLake validate if schema didn't diverge in CheckpointBuilder #12028
- DeltaLake investigate if transaction entry in checkpoint builder is correct #12029
- DeltaLake extract buildSchemaProperties from CheckpointWriter #12030
- DeltaLake determine stats format in checkpoint based on the table configuration #12031
- DeltaLake rework the way cacheMetadataEntries are processed #12032
- DeltaLake add refreshing of log expiration time in S3TransactionLogSynchronizer #12033
- DeltaLake use AccessTrackingFilesystem together with InterfaceTestUtils #12034
- Refactor DeltaLakePageSourceProvider to use file system passed via TableSnapshot #12040
- Add tests for SelectedPortWaitStrategy and move it to trino-testing #12041
@homar thanks! i moved the above list into issue description. Feel free to remove checkboxes from your comment (or the list)
Do I get it right in that vanilla Databricks is not yet supported? This connector requires using the Thrift schema for Hive connection string (IllegalArgumentException: metastoreUri scheme must be thrift), and AFAIK Databricks only exposes the JDBC connection string (eg jdbc:spark://adb-123456789.5.azuredatabricks.net:443/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/123456789/0427-122644-45iadnd;AuthMech=3;UID=token;PWD=<personal-access-token>). You can only use Thrift if you set up a custom metastore for Databricks.
You can only use Thrift if you set up a custom metastore for Databricks.
Yes. Or, use Glue.
AFAIK Databricks only exposes the JDBC connection string (eg
jdbc:spark://adb-123456789.5.azuredatabricks.net:443/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/123456789/0427-122644-45iadnd;AuthMech=3;UID=token;PWD=<personal-access-token>).
We have no plans to connect to Databricks runtime using Databricks JDBC. This would kill most benefits of this connector.
Here's the Databricks docs for setting up an external HMS or Glue https://docs.databricks.com/data/metastores/index.html. Both of those options are supported.
@alexjo2144 Thanks. I was researching if we can use dbt with various sources all through Trino (inspired by this video) and it seems that Databricks is doable as well, although integrating directly through dbt-databricks plugin is more straightforward. For future generaitons: using Databricks through dbt-trino plugin requires setting up and maintaining your own Hive instance and creating a global init script to set the config of each cluster to use that Hive. Also, DBFS is not supported with this method.
I would think these should change things: https://www.databricks.com/blog/extending-databricks-unity-catalog-open-apache-hive-metastore-api, and the Trino versions 440 and above seems to have support for integrating with the Databricks HMS API: https://trino.io/docs/current/object-storage/metastores.html#thrift-metastore-configuration-properties. Does that look promising?