airbyte icon indicating copy to clipboard operation
airbyte copied to clipboard

[source-google-ads]: Concurrent CDK and performance fixes

Open nurikk opened this issue 1 year ago • 6 comments

This PR introduces following changes:

  1. Refactors SourceGoogleAds to use Concurrent CDK #36852

nurikk avatar Apr 05 '24 08:04 nurikk

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
airbyte-docs ⬜️ Ignored (Inspect) Visit Preview Jul 15, 2024 10:35am

vercel[bot] avatar Apr 05 '24 08:04 vercel[bot]

@natikgadzhi can extensibility help reviewing this change?

marcosmarxm avatar Apr 08 '24 17:04 marcosmarxm

Sure. @nurikk, thank you for putting this together! Concurrent CDK is new, and we're careful about it. And google analytics is a popular source, so we want to get this right.

Ideally, I would love to get a review from @lazebnyi, @clnoll, and maybe @bazarnov? Let's use the live testing tool on this and see if concurrent approach works well, is fast, and doesn't drop records.

natikgadzhi avatar Apr 08 '24 19:04 natikgadzhi

@nurikk thanks for submitting this! In your PR description, should "2. Fixes connection validation loop" link to a different issue? I want to make sure I have the full context.

clnoll avatar Apr 12 '24 11:04 clnoll

@nurikk thanks for submitting this! In your PR description, should "2. Fixes connection validation loop" link to a different issue? I want to make sure I have the full context.

oh yes, thanks for spotting! I've updated description to link the correct issue

upd: Fixes connection validation loop was already merged into master, so I remove this from checklist https://github.com/airbytehq/airbyte/pull/36891/files

nurikk avatar Apr 22 '24 13:04 nurikk

Hi @clnoll, thanks for review! I've addressed all your comments. Now this PR is only about concurrent CDK

nurikk avatar Apr 30 '24 09:04 nurikk

@katmarkham this seems like a really good work, but we're out of capacity to push this over the finish line. It's prob up to you and @maxi297 to assign to and complete.

natikgadzhi avatar May 22 '24 15:05 natikgadzhi

@natikgadzhi I'll check this this week

maxi297 avatar May 22 '24 15:05 maxi297

Here are some thoughts on the current situation.

Summary

Not all streams can be easily moved to concurrent.

IncrementalGoogleAdsStream

I think nurikk did a pretty good job of migrating this to concurrency. For now, I'll highlight some points. I can QA a bit more tomorrow to identify edge cases.

State by parent id

We currently don't support state by parent. In this case, the parents are the customers. This was addressed by nurikk here and here. We are moving to a model where the state converter should do very little (just the conversion of the types from state to cursor value and vice-versa) but for now we should be fine with this implementation.

Checkpoint interval

This feature is not supported in the Concurrent CDK but I'm fine with removing this to support concurrency. This would be better if we were to have partitioned state but this would be a breaking change though. This hasn't been addressed as part of this PR but I wouldn't make this a blocker. We can evaluate empirically in prod if this is generating too much flakiness.

IncrementalEventsStream

Slices are not json serializable

The slices contain sets for updated_ids and deleted_ids which are not serializable.

We can either:

  • use a custom encoder to support sets in slices in the CDK:
class SetEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, set):
            return list(obj)
        return json.JSONEncoder.default(self, obj)

... and use it in StreamPartition.__hash__

  • change the slices to have lists instead of sets

State multiple formats

Ultimately, I think this is the point that makes if very hard to migrate to concurrency. I don't have a good grasp of the possible state format but based on this, the state can have two levels of parent i.e. parent stream name and customer_id or be flat if the state is not by customer id. I don't have enough knowledge to know if both are actually used and if it is the case, explain why.

Proposition

To unblock this PR

Remove IncrementalEventsStreams from concurrency

Those streams are a bit more complex and I don't have a good grasp of the state to propose a solution just yet. It shouldn't prevent us from adding IncrementalGoogleAdsStream as concurrent streams which nurikk would benefit from immediately for IncrementalCustomQuery.

Update to the newest CDK version

It is important to update the version of the airbyte_cdk because if there is an issue in prod that requires us to update the CDK we would be blocked as there were breaking changes since the last release ; we added the slice generation logic in the ConcurrentCursor and pushed some logic from the state adapters to the cursor. For this specific PR, it means providing end_provider to the cursor even though it won't be used yet and rename the method convert_to_sequential_state to convert_to_state_message. We don't need to block the PR for not using the slice generation from the Concurrent CDK.

Long term

Generalize the logic

In order for other sources to leverage this, we should add the parent cursor mechanism to the Concurrent CDK. This is a very common pattern that will block the migration of multiple connectors for concurrency.

Improve resiliency

The newest CDK version allows for generating partitioned state which would compensate for the removal of the checkpoint interval. The change should be straighforward and we should be able to do this by enabling partitioned state using the feature flag

This will be a breaking change as the format of the state will be different. However, this change is backward compatible so we would not need the opt-in mechanism.

Poking @girarda to see if this makes sense on the Extensibility and API Sources side.

maxi297 avatar May 22 '24 20:05 maxi297

@maxi297 I agree with everything you said except I think it should be possible to migrate to partitioned state without a breaking change - although we might still want to introduce a breaking change to stop supporting the multiple state formats you mention. This can be done separately from this PR so no need to block on this.

girarda avatar May 22 '24 21:05 girarda

Adding notes from our grooming:

  • There might be some risk about having a low rate limit for Google Ads. We should make sure we support users that have different rate limits
    • @maxi297 to validate if we hit the rate limit today: The rate limit is 15.000 requests in a 24 hours sliding window based on this and this. Upgrading to standard access removes this limit
    • @maxi297 to validate client side rate limiting for gRPC/Google Ads: the client side rate limiting of the Concurrent CDK can't be re-used there as they rely on the requests library and source-google-ads is using gRPC. We can probably handle this as transient issues and retry periodically
  • Just to clarify: the breaking change is not on the source level. We just need to update the CDK so that devs don't have to deal with surprise changes

maxi297 avatar May 28 '24 16:05 maxi297

Hi @maxi297 ! I've made required changes to unblock this PR. Please help me with review

nurikk avatar Jul 15 '24 10:07 nurikk