airbyte icon indicating copy to clipboard operation
airbyte copied to clipboard

✨ [source-hubspot] Implement RFR for contacts_form_submissions, contact_list_memberships, contact_merged_audit

Open brianjlai opened this issue 9 months ago • 4 comments

Closes https://github.com/airbytehq/airbyte-internal-issues/issues/7047

What

There are 5 Hubspot streams that only support full refresh. I've implemented 3 of them:

  • contacts_form_submissions
  • contact_list_memberships
  • contact_merged_audit

I did not implement the others because:

  • email_subscriptions: This does not support any form of pagination and is therefore not RFR compatible
  • marketing_emails: As per the docs, this supports filtering on updated or created and should just be converted to incremental which is better.

How

Bumped the version to 0.87.0 which is the first version to have the RFR Python interfaces

To support RFR as per the design, the streams in question need to implement a version of read_records() which reads the prior page from the stream_slice, reads one page at a time, and updates state w/ the next_page_token.

The Hubspot streams being implemented use a form of cursor pagination where the response contains a pointer to the next contact (specified by the vid). And that vid is passed to the next request using the vidOffset parameter.

All of the full refresh streams derive from the same concept of fetching Contacts records and extracting a specific attribute list and returning them as records so I consolidated the implementation in the ContactsAllBase class.

I also added some mock server tests for each.

Review guide

  1. source_hubspot/streams.py
  2. honestly the rest is just tests

User Impact

Theoretically none, I did refactor contact_merge_audit to use the same flow as the others which was just cleaner, but now we should start seeing checkpoint messages emitted per-page during syncs

Can this PR be safely reverted and rolled back?

No db or config migrations

  • [x] YES 💚
  • [ ] NO ❌

brianjlai avatar May 07 '24 21:05 brianjlai

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
airbyte-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback May 24, 2024 9:39pm

vercel[bot] avatar May 07 '24 21:05 vercel[bot]

Live tests ✅ against a connection: https://github.com/airbytehq/airbyte/actions/runs/9024913362/job/24799667426

Also tested in my own personal workspace for the streams mentioned

brianjlai avatar May 10 '24 01:05 brianjlai

I'll check the PR tomorrow but note that contact_list_memberships will be move to semi-incremental next week

maxi297 avatar May 10 '24 03:05 maxi297

@maxi297 sorry for the delay, switched back to low-code RFR, but back now. to answer some of your questions

Duplication of the logic: do you think we can do something about this? I moved the common logic into one place that both page and pages uses

What happens with contacts_list_memberships that will be semi-incremental? As per our discussion on Tuesday, semi-incremental and RFR won't work with each other for the moment and we don't currently have a strong need for them to right now. We can always revisit this assumption, but it would be a CDK change

can we move most of the logic to the CDK? In theory, as long as next_page_token is exposed, we should be fine. I think the problem in the HubSpot case is that it re-implements read_records but as long as it calls for _read_page, we should be able to just have the change in the CDK. The reasoning is that it feels like it would build more leverage but maybe this is just an impression that needs to be confirmed

I am currently investigating this as an option and seeing how we can get this "for free" for more connectors. My initial thought is that we can't for Hubspot because while it does implement HttpStream.next_page_token, it doesn't actually use the default HttpStream.read_records() which is where in the CDK I would try to add the automatic RFR logic. I will investigate this more as I try to vet the hypothesis. If we can get for free I can always clean out Hubspot, but I'd like to try to get RFR out for Hubspot. The RFR for free implementation if feasible would still be a few weeks out

Otherwise I made most of the other PR suggestions and incorporated your latest set of PRs which adjusted my mock server test implementations to be more in line w/ the review comments

brianjlai avatar May 23 '24 18:05 brianjlai

hubspot_live_test

passing live test run

brianjlai avatar May 25 '24 00:05 brianjlai