beam icon indicating copy to clipboard operation
beam copied to clipboard

sdks/python: enrich data with CloudSQL

Open mohamedawnallah opened this issue 9 months ago • 37 comments

Description

  • [x] Add CloudSQLEnrichmentHandler
  • [x] Unit Test
  • [x] Integration Test
  • [x] Update Enrichment Docs

Closes #30773.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • [x] Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • [ ] Update CHANGES.md with noteworthy changes.
  • [ ] If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels Python tests Java tests Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

mohamedawnallah avatar Mar 24 '25 05:03 mohamedawnallah

Hi @claudevdm,

I think the PR is now ready for review. I’ve run all test cases locally, including the integration tests, and they pass since I’m using a PostgreSQL Cloud SQL managed instance in my Google Cloud for testing.

However, the integration tests currently fail in the CI pipeline because there’s no Cloud SQL testing instance to connect to. It seems like this may require additional setup or administration.

How am I supposed to move this PR forward?

Thanks!

mohamedawnallah avatar Mar 25 '25 23:03 mohamedawnallah

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

github-actions[bot] avatar Mar 26 '25 00:03 github-actions[bot]

assign set of reviewers

mohamedawnallah avatar Mar 26 '25 00:03 mohamedawnallah

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @liferoad for label python. R: @damccorm for label website.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

github-actions[bot] avatar Mar 26 '25 00:03 github-actions[bot]

Thank you for your contribution to the Cloud SQL Enrichment Handler. I'd like to share some thoughts about potential improvements:

  • Query Flexibility: We should consider expanding the interface to allow users to:

    1. Select specific columns rather than always returning all columns
    2. Apply multiple filtering conditions (e.g., filtering on both customer_id AND region)
    3. Support more complex query conditions beyond simple equality checks
  • SQLAlchemy Integration: Given that we're supporting three different database types (MySQL, PostgreSQL, and SQL Server), leveraging SQLAlchemy would:

    1. Provide database-agnostic query building
    2. Handle dialect-specific syntax differences automatically
    3. Offer better protection against SQL injection vulnerabilities
  • Batch Processing Support: For performance optimization, we should implement batched queries similar to the BigQuery handler https://github.com/apache/beam/blob/6426f72a59c00ea1561250af9a3594340a5ee373/sdks/python/apache_beam/transforms/enrichment_handlers/bigquery.py#L180

Could @liferoad and @svetakvsundhar share their thoughts on the user interface design and SQLAlchemy integration?

claudevdm avatar Mar 26 '25 14:03 claudevdm

Reminder, please take a look at this pr: @liferoad @damccorm

github-actions[bot] avatar Apr 03 '25 12:04 github-actions[bot]

waiting on author

damccorm avatar Apr 03 '25 13:04 damccorm

Installed distributions Trying to resolve the latest version from remote

It seems there are issues with the CI. Close and reopen this PR to re-trigger it.

mohamedawnallah avatar Apr 13 '25 01:04 mohamedawnallah

It looks like all CI tests passed. I would love to receive any follow-up feedback :)

cc: @claudevdm, @damccorm

mohamedawnallah avatar Apr 14 '25 10:04 mohamedawnallah

Reminder, please take a look at this pr: @liferoad @damccorm

github-actions[bot] avatar Apr 21 '25 12:04 github-actions[bot]

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @tvalentyn for label python. R: @melap for label website.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

github-actions[bot] avatar Apr 23 '25 12:04 github-actions[bot]

It looks like all CI tests passed. I would love to receive any follow-up feedback :)

cc: @claudevdm, @damccorm

Sorry for the delay, I will take a look this week.

claudevdm avatar Apr 23 '25 12:04 claudevdm

Reminder, please take a look at this pr: @tvalentyn @melap

github-actions[bot] avatar May 01 '25 12:05 github-actions[bot]

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @liferoad for label python. R: @damccorm for label website.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

github-actions[bot] avatar May 05 '25 12:05 github-actions[bot]

waiting on author

damccorm avatar May 06 '25 13:05 damccorm

waiting on author

Hey @damccorm, I have already addressed @claudevdm feedback. I think it is ready for review :)

mohamedawnallah avatar May 06 '25 15:05 mohamedawnallah

waiting on author

Hey @damccorm, I have already addressed @claudevdm feedback. I think it is ready for review :)

I am going to try set up a cloudsql instance so we can write some integration tests.

claudevdm avatar May 06 '25 15:05 claudevdm

I am going to try set up a cloudsql instance so we can write some integration tests.

@claudevdm Thank you for your guidance 🙏

I’m curious if it’s necessary to use a Cloud SQL instance for our current integration test cases. At the moment, I’ve been using the PostgreSQL Testcontainer for integration testing. My understanding is that our goal is to ensure the enrichment handler works with Cloud SQL or any similar SQL adapter. If the integration tests pass with the PostgreSQL container, I would expect them to pass with Cloud SQL as well—unless there’s an issue specific to Cloud SQL, in which case we would actually be testing Cloud SQL itself.

Could you please let me know if I’m missing something or if you’d like me to approach this differently?

mohamedawnallah avatar May 11 '25 15:05 mohamedawnallah

Reminder, please take a look at this pr: @liferoad @damccorm

github-actions[bot] avatar May 19 '25 12:05 github-actions[bot]

Run Python_ML PreCommit 3.10

mohamedawnallah avatar May 22 '25 01:05 mohamedawnallah

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @tvalentyn for label python. R: @melap for label website.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

github-actions[bot] avatar May 26 '25 12:05 github-actions[bot]

Hey @claudevdm, Have you got time to review this PR? I have addressed your feedback and I think it is almost there :)

mohamedawnallah avatar May 26 '25 16:05 mohamedawnallah

Run Portable_Python PreCommit 3.9

mohamedawnallah avatar May 28 '25 22:05 mohamedawnallah

ERROR: Failed building wheel for pymssql

I've addressed claudevdm's last feedback and it works locally. Though I have some troubles installing pymssql adapter for microsoft sql server in the docker containers for testing. I will try to get to it EOD tomorrow. This is may be a related issue I need to investigate more https://github.com/pymssql/pymssql/issues/826.

PS: Sorry for force-push here As far as I could remember there was kind of merge conflict and I did rebasing which updated the timestamps of the git commits (I could have done merging instead). I wanna also mention that I haven't updated the content of previous commits. Just addressed the third round of feedback in the last new commit sdks/python: address claudevdm feedback (3).

mohamedawnallah avatar May 29 '25 13:05 mohamedawnallah

A bunch of tests are failing with

cls = <class 'sqlalchemy.dialects.mysql.pymysql.MySQLDialect_pymysql'>

    @classmethod
    def import_dbapi(cls):
>       return __import__("pymysql")
E       ModuleNotFoundError: No module named 'pymysql'

claudevdm avatar Jun 03 '25 16:06 claudevdm

Run Python PreCommit 3.10

mohamedawnallah avatar Jun 05 '25 20:06 mohamedawnallah

The test test_retry_on_throttling failed. This test likely verifies the retry mechanism on throttling events from Google Cloud Storage (GCS) operations.

Run Python PreCommit 3.9

mohamedawnallah avatar Jun 05 '25 20:06 mohamedawnallah

The CI tests currently fail with [gw1] [ 21%] FAILED apache_beam/io/gcp/gcsio_retry_test.py::TestGCSIORetry::test_retry_on_throttling. Closing and opening PR to trigger them all instead of triggering them manually

mohamedawnallah avatar Jun 05 '25 21:06 mohamedawnallah

=================================== FAILURES =================================== ___________________ TestGCSIORetry.test_retry_on_throttling ____________________ [gw4] linux -- Python 3.9.22 /runner/_work/beam/beam/sdks/python/test-suites/tox/py39/build/srcs/sdks/python/target/.tox-py39-cloud/py39-cloud/bin/python

~~I've fixed the test cases related to the pymysql and pymssql. Now there are other test cases failing that don't seem related to my changes.~~

~~The test_retry_on_throttling fails, which suggests that It isn't properly handling scenarios where GCS temporarily rejects requests because It has exceeded the allowed request rate.~~

~~I have run those test cases multiple times to make sure they are not flaky test cases, but they are still failing consistently.~~

~~How am I supposed to move this PR forward?~~

~~cc: @claudevdm~~

mohamedawnallah avatar Jun 05 '25 23:06 mohamedawnallah

All CI tests have finally passed 😁 I think the PR is now ready for review 🙂

mohamedawnallah avatar Jun 06 '25 19:06 mohamedawnallah