airbyte icon indicating copy to clipboard operation
airbyte copied to clipboard

Normalization creates duplicate rows in the properties table (Source Hubspot)

Open mrhallak opened this issue 2 years ago • 7 comments

Environment

  • Airbyte version: 0.40.18
  • OS Version / Instance: Ubuntu on AWS EC2
  • Deployment: Docker
  • Source Connector and version: Hubspot (v0.2.2)
  • Destination Connector and version: Snowflake (v0.4.38)
  • Step where error happened: Sync Job (Incremental | Deduped + history)

Current Behavior

Running the sync for the company stream, I get no duplicates in companies and companies_scd but I do get duplicate rows for multiple records in companies_properties. I've also tried resetting the data.

Expected Behavior

Every record in companies should have just one record in companies_properties

Logs

Nothing is failing. e7d621ab_9b7c_45c6_b54f_b69ccfa777de_logs_1744_txt.txt

Steps to Reproduce

  1. Create a connection from Hubspot to Snowflake
  2. Set companies to incremental | deduped + history

Are you willing to submit a PR?

Tried to look into the connector's code myself but I couldn't understand which part is related.

mrhallak avatar Nov 09 '22 10:11 mrhallak

Thanks for the issue @mrhallak, can you post the logs?

sajarin avatar Nov 09 '22 15:11 sajarin

@sajarin updated the issue above

mrhallak avatar Nov 09 '22 16:11 mrhallak

This is most likely caused by a bug in normalization

grishick avatar Dec 30 '22 19:12 grishick

Related OC issue that has been open for several months: https://github.com/airbytehq/oncall/issues/878

grishick avatar Dec 30 '22 19:12 grishick

Another related issue: https://github.com/airbytehq/airbyte/issues/9465

grishick avatar Dec 30 '22 20:12 grishick

Hey @grishick the first link doesn't work, can you please repost it?

mrhallak avatar Jan 27 '23 15:01 mrhallak

@mrhallak sorry, that's a link to an internal repo

grishick avatar Jan 27 '23 18:01 grishick

@sajarin @grishick This is also true for other sources as well (for eg, square)

Related slack conversation: https://airbytehq.slack.com/archives/C021JANJ6TY/p1675644634175249

{
   "id": "abc-123" // this is setup as primary key in our source",
   "line_items": [
      {
         "uid": "cde-234", // can we set nested value as primary key as well?
         // other fields 
      }
   ],
   // other data
}

We have id configured as primary keys for orders stream. So, the update takes place correctly. But, since we do not have uid configured for nested data i.e, line_items, it creates redundant line_items whenever orders get updated during incremental sync.

The workaround now is to periodically running cleanup jobs on the orphaned line_items.

It would be great, if we can identify the primary keys of nested data as well in the source itself similar to what we have for streams.

sabbiu avatar Feb 07 '23 00:02 sabbiu

This PR has the fix: https://github.com/airbytehq/airbyte/pull/22381

grishick avatar Feb 07 '23 01:02 grishick