delta-rs
delta-rs copied to clipboard
Identity Columns
Description
Feature definition: https://github.com/delta-io/delta/blob/master/PROTOCOL.md#identity-columns
take
I'll use this as a design/planning/progress board.
Extra points not explicitly mentioned in the Delta protocol definition, from the Databricks documentation (https://learn.microsoft.com/en-us/azure/databricks/delta/generated-columns#--use-identity-columns-in-delta-lake)
- Identity column must be defined at the table creating time, it cannot be added to an existing table:
- Declaring an identity column on a Delta table disables concurrent transactions on the table
- Delta-io implementation of GENERATED columns is incomplete/flaky at the moment (I was unable to make it run due to some obscure errors - actually SQL and DeltaTable API methods show different errors despite using the same
sparkinstance), therefore the behavior/outcome not explicitly defined in the protocol was tested on Databricks
Proposed implementation steps:
Part1
- [x] ~Define a structure to hold an Identity column properties~
- [ ] Extend the table metadata to store Identity column related properties
- [ ] Allow adding an Identity column definition to a schema on table create
Part 2
- [ ] mplement only the
generatedAlwaysAsoption which does not allow the identity insertion overwrite - [ ] On write, generate the Identity column values. Fail if generated values overflow
- [ ] Extend the write commit to record
delta.identity.highWaterMark
Part 3
- [ ] implement
generatedByDefaultAsoption allowing identity insertion to be overwritten by explicitly specifying the Identity values. Protocol does not define the exact behavior so this his how it works in Databricks:- Identity value specified is not validated against the current
delta.identity.highWaterMarkor IDENTITY column definition rule, e.g. it can be any value of INT type, including a value smaller thanSTART WITH(e.g. negative), an existing or future value that will be generated is not explicity specified delta.identity.highWaterMarkis not changed when an explicit value is used
- Identity value specified is not validated against the current
- [ ] Detect and fail concurrent writes on tables with an Identity column
On point 5, my understanding is you indeed have to check for duplicates during an append or merge, since the column values need to be unique.
Also on the other points, probably easy to check against spark-delta what it will do
I've done some research and updated the execution plan above. Delta-io implementation of GENERATED columns is incomplete/flaky at the moment (I was unable to make it run due to some obscure errors - actually SQL and DeltaTable API methods show different errors despite using the same spark instance), therefore the behavior/outcome not explicitly defined in the protocol was tested on Databricks