delta-rs icon indicating copy to clipboard operation
delta-rs copied to clipboard

feat: add deltaOps set metadata operation

Open HawaiianSpork opened this issue 10 months ago • 5 comments

Description

Allow for the explicit changing of the metadata of a delta table. This allows for simple schema migrations like changing the metadata of a column or adding new nullable columns. The code doesn't currently do any checks that the table would still be readable after changing the metadata. The setMetadata operation is similar to mergeSchema but doesn't require a write at the same time so it can be run and tested as part of a deployment instead of on the next write of data.

Note: you used to be able to do this by recalling DeltaOps::create with overwrite on an existing table but since that was recently fixed to delete old data this allows for recreating that original behavior.

HawaiianSpork avatar May 02 '24 03:05 HawaiianSpork

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

github-actions[bot] avatar May 02 '24 03:05 github-actions[bot]

Unfortunately it isn't that simple. If you do it like this you could put the table in an invalid state because the metadata contains schema, partitionColumns and configuration. For each one of them you need to do many checks before you can change it.

For the configuration part I have 2 PRs open: #2264 #2075

For partitionColumns, you can't change that, at this point we don't allow evolving the partition columns of a table. And with respect to schema evolution or changes to it. That all needs to go into operations such as ALTER table DROP COLUMN, ALTER table ADD COLUMN

ion-elgreco avatar May 02 '24 06:05 ion-elgreco

Unfortunately it isn't that simple. If you do it like this you could put the table in an invalid state because the metadata contains schema, partitionColumns and configuration. For each one of them you need to do many checks before you can change it.

For the configuration part I have 2 PRs open: #2264 #2075

For partitionColumns, you can't change that, at this point we don't allow evolving the partition columns of a table. And with respect to schema evolution or changes to it. That all needs to go into operations such as ALTER table DROP COLUMN, ALTER table ADD COLUMN

Thank you @ion-elgreco , I was not aware that you had added support for setting table properties with #2264. If this operation added more checking that the old and new metadata were compatible would that be acceptable? ADD COLUMN feature would be great but is missing the ability to modify existing columns (to add nested fields to structs) that I would like to use.

HawaiianSpork avatar May 06 '24 20:05 HawaiianSpork

@HawaiianSpork I don't see how you wouldn't be able to add a nested field in a struct column with ADD COLUMN

I think it's still safe since you add something. But probably good to verify what happens when you read two parquet with partially different struct schema

ion-elgreco avatar May 06 '24 21:05 ion-elgreco

Good point, I had assumed ADD COLUMN only worked top level columns but at least in the Spark world nested columns are supported. So I guess I have to add ADD COLUMN support to delta-rs...

HawaiianSpork avatar May 07 '24 02:05 HawaiianSpork

@HawaiianSpork fyi, I am adding an add column operation here: https://github.com/delta-io/delta-rs/pull/2562, it will supported nested columns as well, since we leverage the schema evolution code

So will close this one

ion-elgreco avatar Jun 04 '24 10:06 ion-elgreco