[source-google-ads]: Concurrent CDK and performance fixes
This PR introduces following changes:
- Refactors
SourceGoogleAdsto use Concurrent CDK #36852
The latest updates on your projects. Learn more about Vercel for Git ↗︎
1 Skipped Deployment
| Name | Status | Preview | Comments | Updated (UTC) |
|---|---|---|---|---|
| airbyte-docs | ⬜️ Ignored (Inspect) | Visit Preview | Jul 15, 2024 10:35am |
@natikgadzhi can extensibility help reviewing this change?
Sure. @nurikk, thank you for putting this together! Concurrent CDK is new, and we're careful about it. And google analytics is a popular source, so we want to get this right.
Ideally, I would love to get a review from @lazebnyi, @clnoll, and maybe @bazarnov? Let's use the live testing tool on this and see if concurrent approach works well, is fast, and doesn't drop records.
@nurikk thanks for submitting this! In your PR description, should "2. Fixes connection validation loop" link to a different issue? I want to make sure I have the full context.
@nurikk thanks for submitting this! In your PR description, should "2. Fixes connection validation loop" link to a different issue? I want to make sure I have the full context.
oh yes, thanks for spotting! I've updated description to link the correct issue
upd: Fixes connection validation loop was already merged into master, so I remove this from checklist https://github.com/airbytehq/airbyte/pull/36891/files
Hi @clnoll, thanks for review! I've addressed all your comments. Now this PR is only about concurrent CDK
@katmarkham this seems like a really good work, but we're out of capacity to push this over the finish line. It's prob up to you and @maxi297 to assign to and complete.
@natikgadzhi I'll check this this week
Here are some thoughts on the current situation.
Summary
Not all streams can be easily moved to concurrent.
IncrementalGoogleAdsStream
I think nurikk did a pretty good job of migrating this to concurrency. For now, I'll highlight some points. I can QA a bit more tomorrow to identify edge cases.
State by parent id
We currently don't support state by parent. In this case, the parents are the customers. This was addressed by nurikk here and here. We are moving to a model where the state converter should do very little (just the conversion of the types from state to cursor value and vice-versa) but for now we should be fine with this implementation.
Checkpoint interval
This feature is not supported in the Concurrent CDK but I'm fine with removing this to support concurrency. This would be better if we were to have partitioned state but this would be a breaking change though. This hasn't been addressed as part of this PR but I wouldn't make this a blocker. We can evaluate empirically in prod if this is generating too much flakiness.
IncrementalEventsStream
Slices are not json serializable
The slices contain sets for updated_ids and deleted_ids which are not serializable.
We can either:
- use a custom encoder to support sets in slices in the CDK:
class SetEncoder(json.JSONEncoder):
def default(self, obj):
if isinstance(obj, set):
return list(obj)
return json.JSONEncoder.default(self, obj)
... and use it in StreamPartition.__hash__
- change the slices to have lists instead of sets
State multiple formats
Ultimately, I think this is the point that makes if very hard to migrate to concurrency. I don't have a good grasp of the possible state format but based on this, the state can have two levels of parent i.e. parent stream name and customer_id or be flat if the state is not by customer id. I don't have enough knowledge to know if both are actually used and if it is the case, explain why.
Proposition
To unblock this PR
Remove IncrementalEventsStreams from concurrency
Those streams are a bit more complex and I don't have a good grasp of the state to propose a solution just yet. It shouldn't prevent us from adding IncrementalGoogleAdsStream as concurrent streams which nurikk would benefit from immediately for IncrementalCustomQuery.
Update to the newest CDK version
It is important to update the version of the airbyte_cdk because if there is an issue in prod that requires us to update the CDK we would be blocked as there were breaking changes since the last release ; we added the slice generation logic in the ConcurrentCursor and pushed some logic from the state adapters to the cursor. For this specific PR, it means providing end_provider to the cursor even though it won't be used yet and rename the method convert_to_sequential_state to convert_to_state_message. We don't need to block the PR for not using the slice generation from the Concurrent CDK.
Long term
Generalize the logic
In order for other sources to leverage this, we should add the parent cursor mechanism to the Concurrent CDK. This is a very common pattern that will block the migration of multiple connectors for concurrency.
Improve resiliency
The newest CDK version allows for generating partitioned state which would compensate for the removal of the checkpoint interval. The change should be straighforward and we should be able to do this by enabling partitioned state using the feature flag
This will be a breaking change as the format of the state will be different. However, this change is backward compatible so we would not need the opt-in mechanism.
Poking @girarda to see if this makes sense on the Extensibility and API Sources side.
@maxi297 I agree with everything you said except I think it should be possible to migrate to partitioned state without a breaking change - although we might still want to introduce a breaking change to stop supporting the multiple state formats you mention. This can be done separately from this PR so no need to block on this.
Adding notes from our grooming:
- There might be some risk about having a low rate limit for Google Ads. We should make sure we support users that have different rate limits
- @maxi297 to validate if we hit the rate limit today: The rate limit is 15.000 requests in a 24 hours sliding window based on this and this. Upgrading to standard access removes this limit
- @maxi297 to validate client side rate limiting for gRPC/Google Ads: the client side rate limiting of the Concurrent CDK can't be re-used there as they rely on the
requestslibrary and source-google-ads is using gRPC. We can probably handle this as transient issues and retry periodically
- Just to clarify: the breaking change is not on the source level. We just need to update the CDK so that devs don't have to deal with surprise changes
Hi @maxi297 ! I've made required changes to unblock this PR. Please help me with review