dlt
dlt copied to clipboard
Rename confusing "primary_key" on Incremental class
Background
primary_key on Incremental tell the class which columns it should use for boundary deduplication. Since it should identify the rows uniquely, very often those are the same columns that are used in resource for primary key. Still the function is very different. Setting primary_key in Incremental does not impact resource settings.
Tasks
-
- [ ] Add a new field to
Incrementalclass: iededup_keyand use it internally, in examples and docs. See howprimary_keyis currently implemented
- [ ] Add a new field to
-
- [ ] issue deprecation warning ie. in
__init__, property setters etc. - (Incrementalis also configspec!)
- [ ] issue deprecation warning ie. in
-
- [ ] propagate
primary_keytodedup_keywhen set and vice versa! (we must be backward-compat so reading dedup key viaprimary_keyprop must still work!)
- [ ] propagate
-
- [ ] test (3): when using initializer and when setting the property. Make sure that
mergeworks andparse_native_value
- [ ] test (3): when using initializer and when setting the property. Make sure that
luckily there's very little usage in our code base for this prop
I agree that this is an important improvement. I believe incremental_key is clearer than dedup_key.
AFAIU, you have 3 types of keys:
primary_key: one or more columns that uniquely identify an entity.incremental_key(currentlyprimary_keyon Incremental class): one or more columns used for incremental loading. The tuple(primary_key, incremental_key)uniquely identifies a record / row in a table. Incremental loading implies that you should never have duplicates for(primary_key, incremental_key)merge_key: I don't really understand what they're for. It's a primary key with setting aPKconstraint on the destination?
Examples
You have a social media platform where users can like posts. Your event feed table might have
event_id | user_id | event_type | post_id
(1, zilto, liked, 10)
(2, zilto, unliked, 10)
(3, zilto, liked, 22)
(4, zilto, liked, 10)
event_idis theincremental_key. Typically, those are non-null and unique. But must be unique on(incremental_key, primary_key).- If I want to only store the "current state of user-post interactions", I could set
primary_key=(event_id, post_id) - After these events, I would get only two rows:
(3, zilto, liked, 22)and(4, zilto, liked 10) - The
incremental_keyis necessary to disambiguate(1, zilto, liked, 10)and(4, zilto, liked 10). Those are two real events, not duplicates