delta-rs icon indicating copy to clipboard operation
delta-rs copied to clipboard

Identity Columns

Open r3stl355 opened this issue 1 year ago • 4 comments

Description

Feature definition: https://github.com/delta-io/delta/blob/master/PROTOCOL.md#identity-columns

r3stl355 avatar Dec 15 '23 13:12 r3stl355

take

r3stl355 avatar Dec 15 '23 15:12 r3stl355

I'll use this as a design/planning/progress board.

Extra points not explicitly mentioned in the Delta protocol definition, from the Databricks documentation (https://learn.microsoft.com/en-us/azure/databricks/delta/generated-columns#--use-identity-columns-in-delta-lake)

  • Identity column must be defined at the table creating time, it cannot be added to an existing table:
  • Declaring an identity column on a Delta table disables concurrent transactions on the table
  • Delta-io implementation of GENERATED columns is incomplete/flaky at the moment (I was unable to make it run due to some obscure errors - actually SQL and DeltaTable API methods show different errors despite using the same spark instance), therefore the behavior/outcome not explicitly defined in the protocol was tested on Databricks

Proposed implementation steps:

Part1

  • [x] ~Define a structure to hold an Identity column properties~
  • [ ] Extend the table metadata to store Identity column related properties
  • [ ] Allow adding an Identity column definition to a schema on table create

Part 2

  • [ ] mplement only the generatedAlwaysAs option which does not allow the identity insertion overwrite
  • [ ] On write, generate the Identity column values. Fail if generated values overflow
  • [ ] Extend the write commit to record delta.identity.highWaterMark
Part 3
  • [ ] implement generatedByDefaultAs option allowing identity insertion to be overwritten by explicitly specifying the Identity values. Protocol does not define the exact behavior so this his how it works in Databricks:
    • Identity value specified is not validated against the current delta.identity.highWaterMark or IDENTITY column definition rule, e.g. it can be any value of INT type, including a value smaller than START WITH (e.g. negative), an existing or future value that will be generated is not explicity specified
    • delta.identity.highWaterMark is not changed when an explicit value is used
  • [ ] Detect and fail concurrent writes on tables with an Identity column

r3stl355 avatar Dec 17 '23 22:12 r3stl355

On point 5, my understanding is you indeed have to check for duplicates during an append or merge, since the column values need to be unique.

Also on the other points, probably easy to check against spark-delta what it will do

ion-elgreco avatar Dec 17 '23 22:12 ion-elgreco

I've done some research and updated the execution plan above. Delta-io implementation of GENERATED columns is incomplete/flaky at the moment (I was unable to make it run due to some obscure errors - actually SQL and DeltaTable API methods show different errors despite using the same spark instance), therefore the behavior/outcome not explicitly defined in the protocol was tested on Databricks

r3stl355 avatar Jan 14 '24 16:01 r3stl355