airbyte
airbyte copied to clipboard
✨ [source-hubspot] Implement RFR for contacts_form_submissions, contact_list_memberships, contact_merged_audit
Closes https://github.com/airbytehq/airbyte-internal-issues/issues/7047
What
There are 5 Hubspot streams that only support full refresh. I've implemented 3 of them:
- contacts_form_submissions
- contact_list_memberships
- contact_merged_audit
I did not implement the others because:
-
email_subscriptions
: This does not support any form of pagination and is therefore not RFR compatible -
marketing_emails
: As per the docs, this supports filtering onupdated
orcreated
and should just be converted to incremental which is better.
How
Bumped the version to 0.87.0
which is the first version to have the RFR Python interfaces
To support RFR as per the design, the streams in question need to implement a version of read_records()
which reads the prior page from the stream_slice
, reads one page at a time, and updates state w/ the next_page_token
.
The Hubspot streams being implemented use a form of cursor pagination where the response contains a pointer to the next contact (specified by the vid
). And that vid
is passed to the next request using the vidOffset
parameter.
All of the full refresh streams derive from the same concept of fetching Contacts records and extracting a specific attribute list and returning them as records so I consolidated the implementation in the ContactsAllBase
class.
I also added some mock server tests for each.
Review guide
-
source_hubspot/streams.py
- honestly the rest is just tests
User Impact
Theoretically none, I did refactor contact_merge_audit
to use the same flow as the others which was just cleaner, but now we should start seeing checkpoint messages emitted per-page during syncs
Can this PR be safely reverted and rolled back?
No db or config migrations
- [x] YES 💚
- [ ] NO ❌
The latest updates on your projects. Learn more about Vercel for Git ↗︎
Name | Status | Preview | Comments | Updated (UTC) |
---|---|---|---|---|
airbyte-docs | ✅ Ready (Inspect) | Visit Preview | 💬 Add feedback | May 24, 2024 9:39pm |
Live tests ✅ against a connection: https://github.com/airbytehq/airbyte/actions/runs/9024913362/job/24799667426
Also tested in my own personal workspace for the streams mentioned
I'll check the PR tomorrow but note that contact_list_memberships
will be move to semi-incremental next week
@maxi297 sorry for the delay, switched back to low-code RFR, but back now. to answer some of your questions
Duplication of the logic: do you think we can do something about this? I moved the common logic into one place that both page and pages uses
What happens with contacts_list_memberships that will be semi-incremental? As per our discussion on Tuesday, semi-incremental and RFR won't work with each other for the moment and we don't currently have a strong need for them to right now. We can always revisit this assumption, but it would be a CDK change
can we move most of the logic to the CDK? In theory, as long as next_page_token is exposed, we should be fine. I think the problem in the HubSpot case is that it re-implements read_records but as long as it calls for _read_page, we should be able to just have the change in the CDK. The reasoning is that it feels like it would build more leverage but maybe this is just an impression that needs to be confirmed
I am currently investigating this as an option and seeing how we can get this "for free" for more connectors. My initial thought is that we can't for Hubspot because while it does implement HttpStream.next_page_token
, it doesn't actually use the default HttpStream.read_records()
which is where in the CDK I would try to add the automatic RFR logic. I will investigate this more as I try to vet the hypothesis. If we can get for free I can always clean out Hubspot, but I'd like to try to get RFR out for Hubspot. The RFR for free implementation if feasible would still be a few weeks out
Otherwise I made most of the other PR suggestions and incorporated your latest set of PRs which adjusted my mock server test implementations to be more in line w/ the review comments
passing live test run