airbyte
airbyte copied to clipboard
Normalization creates duplicate rows in the properties table (Source Hubspot)
Environment
- Airbyte version: 0.40.18
- OS Version / Instance: Ubuntu on AWS EC2
- Deployment: Docker
- Source Connector and version: Hubspot (v0.2.2)
- Destination Connector and version: Snowflake (v0.4.38)
- Step where error happened: Sync Job (Incremental | Deduped + history)
Current Behavior
Running the sync for the company stream, I get no duplicates in companies
and companies_scd
but I do get duplicate rows for multiple records in companies_properties
. I've also tried resetting the data.
Expected Behavior
Every record in companies should have just one record in companies_properties
Logs
Nothing is failing. e7d621ab_9b7c_45c6_b54f_b69ccfa777de_logs_1744_txt.txt
Steps to Reproduce
- Create a connection from Hubspot to Snowflake
- Set companies to incremental | deduped + history
Are you willing to submit a PR?
Tried to look into the connector's code myself but I couldn't understand which part is related.
Thanks for the issue @mrhallak, can you post the logs?
@sajarin updated the issue above
This is most likely caused by a bug in normalization
Related OC issue that has been open for several months: https://github.com/airbytehq/oncall/issues/878
Another related issue: https://github.com/airbytehq/airbyte/issues/9465
Hey @grishick the first link doesn't work, can you please repost it?
@mrhallak sorry, that's a link to an internal repo
@sajarin @grishick This is also true for other sources as well (for eg, square)
Related slack conversation: https://airbytehq.slack.com/archives/C021JANJ6TY/p1675644634175249
{
"id": "abc-123" // this is setup as primary key in our source",
"line_items": [
{
"uid": "cde-234", // can we set nested value as primary key as well?
// other fields
}
],
// other data
}
We have id
configured as primary keys for orders
stream. So, the update takes place correctly. But, since we do not have uid
configured for nested data i.e, line_items
, it creates redundant line_items
whenever orders
get updated during incremental sync.
The workaround now is to periodically running cleanup jobs on the orphaned line_items
.
It would be great, if we can identify the primary keys of nested data as well in the source itself similar to what we have for streams.
This PR has the fix: https://github.com/airbytehq/airbyte/pull/22381