sdks/python: enrich data with CloudSQL
Description
- [x] Add
CloudSQLEnrichmentHandler - [x] Unit Test
- [x] Integration Test
- [x] Update Enrichment Docs
Closes #30773.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
- [x] Mention the appropriate issue in your description (for example:
addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead. - [ ] Update
CHANGES.mdwith noteworthy changes. - [ ] If this contribution is large, please file an Apache Individual Contributor License Agreement.
See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.
Hi @claudevdm,
I think the PR is now ready for review. I’ve run all test cases locally, including the integration tests, and they pass since I’m using a PostgreSQL Cloud SQL managed instance in my Google Cloud for testing.
However, the integration tests currently fail in the CI pipeline because there’s no Cloud SQL testing instance to connect to. It seems like this may require additional setup or administration.
How am I supposed to move this PR forward?
Thanks!
Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers
assign set of reviewers
Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:
R: @liferoad for label python. R: @damccorm for label website.
Available commands:
stop reviewer notifications- opt out of the automated review toolingremind me after tests pass- tag the comment author after tests passwaiting on author- shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)
The PR bot will only process comments in the main thread (not review comments).
Thank you for your contribution to the Cloud SQL Enrichment Handler. I'd like to share some thoughts about potential improvements:
-
Query Flexibility: We should consider expanding the interface to allow users to:
- Select specific columns rather than always returning all columns
- Apply multiple filtering conditions (e.g., filtering on both customer_id AND region)
- Support more complex query conditions beyond simple equality checks
-
SQLAlchemy Integration: Given that we're supporting three different database types (MySQL, PostgreSQL, and SQL Server), leveraging SQLAlchemy would:
- Provide database-agnostic query building
- Handle dialect-specific syntax differences automatically
- Offer better protection against SQL injection vulnerabilities
-
Batch Processing Support: For performance optimization, we should implement batched queries similar to the BigQuery handler https://github.com/apache/beam/blob/6426f72a59c00ea1561250af9a3594340a5ee373/sdks/python/apache_beam/transforms/enrichment_handlers/bigquery.py#L180
Could @liferoad and @svetakvsundhar share their thoughts on the user interface design and SQLAlchemy integration?
Reminder, please take a look at this pr: @liferoad @damccorm
waiting on author
Installed distributions Trying to resolve the latest version from remote
It seems there are issues with the CI. Close and reopen this PR to re-trigger it.
It looks like all CI tests passed. I would love to receive any follow-up feedback :)
cc: @claudevdm, @damccorm
Reminder, please take a look at this pr: @liferoad @damccorm
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:
R: @tvalentyn for label python. R: @melap for label website.
Available commands:
stop reviewer notifications- opt out of the automated review toolingremind me after tests pass- tag the comment author after tests passwaiting on author- shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)
It looks like all CI tests passed. I would love to receive any follow-up feedback :)
cc: @claudevdm, @damccorm
Sorry for the delay, I will take a look this week.
Reminder, please take a look at this pr: @tvalentyn @melap
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:
R: @liferoad for label python. R: @damccorm for label website.
Available commands:
stop reviewer notifications- opt out of the automated review toolingremind me after tests pass- tag the comment author after tests passwaiting on author- shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)
waiting on author
waiting on author
Hey @damccorm, I have already addressed @claudevdm feedback. I think it is ready for review :)
waiting on author
Hey @damccorm, I have already addressed @claudevdm feedback. I think it is ready for review :)
I am going to try set up a cloudsql instance so we can write some integration tests.
I am going to try set up a cloudsql instance so we can write some integration tests.
@claudevdm Thank you for your guidance 🙏
I’m curious if it’s necessary to use a Cloud SQL instance for our current integration test cases. At the moment, I’ve been using the PostgreSQL Testcontainer for integration testing. My understanding is that our goal is to ensure the enrichment handler works with Cloud SQL or any similar SQL adapter. If the integration tests pass with the PostgreSQL container, I would expect them to pass with Cloud SQL as well—unless there’s an issue specific to Cloud SQL, in which case we would actually be testing Cloud SQL itself.
Could you please let me know if I’m missing something or if you’d like me to approach this differently?
Reminder, please take a look at this pr: @liferoad @damccorm
Run Python_ML PreCommit 3.10
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:
R: @tvalentyn for label python. R: @melap for label website.
Available commands:
stop reviewer notifications- opt out of the automated review toolingremind me after tests pass- tag the comment author after tests passwaiting on author- shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)
Hey @claudevdm, Have you got time to review this PR? I have addressed your feedback and I think it is almost there :)
Run Portable_Python PreCommit 3.9
ERROR: Failed building wheel for pymssql
I've addressed claudevdm's last feedback and it works locally. Though I have some troubles installing pymssql adapter for microsoft sql server in the docker containers for testing. I will try to get to it EOD tomorrow. This is may be a related issue I need to investigate more https://github.com/pymssql/pymssql/issues/826.
PS:
Sorry for force-push here As far as I could remember there was kind of merge conflict and I did rebasing which updated the timestamps of the git commits (I could have done merging instead). I wanna also mention that I haven't updated the content of previous commits. Just addressed the third round of feedback in the last new commit sdks/python: address claudevdm feedback (3).
A bunch of tests are failing with
cls = <class 'sqlalchemy.dialects.mysql.pymysql.MySQLDialect_pymysql'>
@classmethod
def import_dbapi(cls):
> return __import__("pymysql")
E ModuleNotFoundError: No module named 'pymysql'
Run Python PreCommit 3.10
The test test_retry_on_throttling failed. This test likely verifies the retry mechanism on throttling events from Google Cloud Storage (GCS) operations.
Run Python PreCommit 3.9
The CI tests currently fail with [gw1] [ 21%] FAILED apache_beam/io/gcp/gcsio_retry_test.py::TestGCSIORetry::test_retry_on_throttling. Closing and opening PR to trigger them all instead of triggering them manually
=================================== FAILURES =================================== ___________________ TestGCSIORetry.test_retry_on_throttling ____________________ [gw4] linux -- Python 3.9.22 /runner/_work/beam/beam/sdks/python/test-suites/tox/py39/build/srcs/sdks/python/target/.tox-py39-cloud/py39-cloud/bin/python
~~I've fixed the test cases related to the pymysql and pymssql. Now there are other test cases failing that don't seem related to my changes.~~
~~The test_retry_on_throttling fails, which suggests that It isn't properly handling scenarios where GCS temporarily rejects requests because It has exceeded the allowed request rate.~~
~~I have run those test cases multiple times to make sure they are not flaky test cases, but they are still failing consistently.~~
~~How am I supposed to move this PR forward?~~
~~cc: @claudevdm~~
All CI tests have finally passed 😁 I think the PR is now ready for review 🙂