dlt icon indicating copy to clipboard operation
dlt copied to clipboard

Rename confusing "primary_key" on Incremental class

Open rudolfix opened this issue 6 months ago • 1 comments

Background primary_key on Incremental tell the class which columns it should use for boundary deduplication. Since it should identify the rows uniquely, very often those are the same columns that are used in resource for primary key. Still the function is very different. Setting primary_key in Incremental does not impact resource settings.

Tasks

    • [ ] Add a new field to Incremental class: ie dedup_key and use it internally, in examples and docs. See how primary_key is currently implemented
    • [ ] issue deprecation warning ie. in __init__, property setters etc. - (Incremental is also configspec!)
    • [ ] propagate primary_key to dedup_key when set and vice versa! (we must be backward-compat so reading dedup key via primary_key prop must still work!)
    • [ ] test (3): when using initializer and when setting the property. Make sure that merge works and parse_native_value

luckily there's very little usage in our code base for this prop

rudolfix avatar Jun 12 '25 08:06 rudolfix

I agree that this is an important improvement. I believe incremental_key is clearer than dedup_key.

AFAIU, you have 3 types of keys:

  • primary_key: one or more columns that uniquely identify an entity.
  • incremental_key (currently primary_key on Incremental class): one or more columns used for incremental loading. The tuple (primary_key, incremental_key) uniquely identifies a record / row in a table. Incremental loading implies that you should never have duplicates for (primary_key, incremental_key)
  • merge_key: I don't really understand what they're for. It's a primary key with setting a PK constraint on the destination?

Examples

You have a social media platform where users can like posts. Your event feed table might have

event_id | user_id | event_type | post_id
(1, zilto, liked, 10)
(2, zilto, unliked, 10)
(3, zilto, liked, 22)
(4, zilto, liked, 10)
  • event_id is the incremental_key. Typically, those are non-null and unique. But must be unique on (incremental_key, primary_key).
  • If I want to only store the "current state of user-post interactions", I could set primary_key=(event_id, post_id)
  • After these events, I would get only two rows: (3, zilto, liked, 22) and (4, zilto, liked 10)
  • The incremental_key is necessary to disambiguate (1, zilto, liked, 10) and (4, zilto, liked 10). Those are two real events, not duplicates

zilto avatar Sep 30 '25 16:09 zilto