kedro-plugins
kedro-plugins copied to clipboard
feat(datasets): Implement `spark.GBQQueryDataset` for reading data from BigQuery as a spark dataframe using SQL query
Description
- The current implementation of
spark.SparkDatasethas rudimentary support for reading data from bigquery using a SQL query. - Adding this functionality in
spark.SparkDatasetmay not comply withkedro_datasetsdesign principles as it requires makingfilepathas an optional argument. - Further, similar to
pandas.GBQQueryDataset, thespark.GBQQueryDatasetis also aread-onlydataset, hence it's a more suited implementation to maintain the overall design of datasets.
Development notes
To test the dataset, the following is the manual way:
- Create a GCP Project (
project_id) - Create a test dataset inside BigQuery as
<project)_id>.<test_dataset> -
- Create a test materialization dataset inside BigQuery as
<project)_id>.<test_mat_dataset>
- Create a test materialization dataset inside BigQuery as
- Create a test table inside the test dataset as
<project)_id>.<test_dataset>.<test_table> - Create a service account with following permissions:
-
- Download service account credentials json key
>>> from kedro_datasets.spark import GBQQueryDataset
>>> import pyspark.sql as sql
>>>
>>> # Define your SQL query
>>> sql = "SELECT * FROM `<project)_id>.<test_dataset>.<test_table>`"
>>>
>>> # Initialize dataset
>>> dataset = GBQQueryDataset(
... sql=sql,
... materialization_dataset="your_dataset",
... materialization_project="your_project", # optional
... credentials=dict(file="/path/to/your/credentials.json"),
... )
>>>
>>> # Load data
>>> df = dataset.load()
>>>
>>> # Example output
>>> df.show()
Checklist
- [x] Opened this PR as a 'Draft Pull Request' if it is work-in-progress
- [ ] Updated the documentation to reflect the code changes
- [ ] Updated
jsonschema/kedro-catalog-X.XX.jsonif necessary - [ ] Added a description of this change in the relevant
RELEASE.mdfile - [x] Added tests to cover my changes
- [ ] Received approvals from at least half of the TSC (required for adding a new, non-experimental dataset)
So this is a great first start - the credentials resolution looks complicated, but also well thought out.
I think we'd need to see some tests for this to go in as is, you can take some inspiration from the pandas equivalent
That also being said we could look at contributing this to the experimental part of [kedro-datasets-experimental](https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/CONTRIBUTING.md#experimental-datasets) which has a lower threshold for admission.
Thanks @datajoely , @astrojuanlu
Credentials Handling
Yes, the credentials handling is a bit different than the rest of the datasets. Also, I do think that spark.SparkDataset does not support setting GCP service account credentials, rather always relies on setting credentials for the parsed filesystem (fsspec).
Earlier, I have been reading bigquery tables using spark.SparkDataset using the following methods of auth:
my_dataset:
type: spark.SparkDataset
file_format: bigquery
filepath: /tmp/my_table.parquet # Just because it's an non optional arg with type `str`
load_args:
table: "<my_project_id>.<my_dataset>.<my_table>"
- Work on a GCP resource which already has the required authorization via assigned service account
- OR, download service account json key, and set env var
export GOOGLE_APPLICATION_CREDENTIALS=`/path/to/credentials.json
- OR, set the following as spark conf
spark.hadoop.google.cloud.auth.service.account.enable: true
spark.hadoop.google.cloud.auth.service.account.json.keyfile: /path/to/credentials.json
With above dataset, I wanted to allow passing credentials directly to the dataset. But it seems, we may have to standardize it a little bit for all other kedro datasets for this GCP case.
Implementing tests
And let me take a look at how tests can be implemented for this. Initial thoughts: Since this doesn't involve a bigquery client, hence the method of mocking (as in pandas.GBQQueryDataset may not be relevant here).
Moving to experimental
For moving this to experimental, let me know and I'll lift and shift this to kedro_datasets_experimental namespace :)
Added some tests.
Related to credentials, one thing that isn't implemented here is reading the SQL query from a text file, which maybe placed anywhere, hence reading the SQL text file itself may require credentials. This is already implemented in pandas.GBQQueryDataset. I haven't implemented providing filepath argument here in spark.GBQQueryDataset.
So this can necessisate providing 2 types of credentials? Or maybe just one could suffice, and use it authenticate with BigQuery as well as any storage backend?
So, now I have implemented passing separate credentials for bigquery and filesystem (since sql file maybe placed on any filesystem which may require specific auth).
I feel the credentials handling is a bit unstandardized across datasets, however I have done what suits this case best.
Found a few kedro discussions on credentials:
- https://github.com/kedro-org/kedro/wiki/Credentials-in-kedro
- https://github.com/kedro-org/kedro/issues/1646
Part of what makes it tricky is:
- Making sure Python API and YAML API are compatible, and the interface doesn't burdern user with overly specific schema
- Authentication and Authorization is a BIG topic with alot of nuances. I think it's best for kedro to leave the heavylifting of credentials management to third party client libraries, and have as less custom code for this.
- Anything more custom should be implemented by the user (but kedro may have guidance on it as it matures more)
- The where do we keep it, and how do we pass it dilemma. Often times it isn't clear as to where the credentials should be managed in kedro (
local, or as environment variable etc)
Happy to contribute to standardizing credentials handling in kedro as we move forward. However I do realise, this would require quite a bit of research (and user surveys) to perfect.
Hi @abhi8893 , thank you for this contribution! I haven't used BigQuery myself, but generally the code looks fine to me. There's still some failing builds, which will need to pass before we can put this up for a vote. Otherwise, like @datajoely suggested, this could be contributed as experimental, meaning the build checks are less strict.
@merelcht Finally got around to picking this up back again. Now the test coverage is 100% and all tests pass. If the kedro team takes a vote to put this to experimental, then will shift it there 🙂
This PR has gone through review, thanks to @ankatiyar and @ravi-kumar-pilla, so now calling for a TSC vote to have this dataset added to the core datasets: @kedro-org/kedro-developers
My main question is, what specific is this dataset related to BigQuery? From the implementation there is no bigquery related import.
From my understanding,
SparkDatasettoday supportbigqueryas an engine but it doesn't use thesqlinterface. So is this more like a generic Spark SQL Dataset?
Hey @noklam , thanks for the review.
This dataset is still very specific to Bigquery, and is not a generic spark SQL dataset. Reading from other SQL warehouses like Azure Synapse, AWS Redshift, Snowflake will require different mechanisms for reading + authorization.
spark.SparkDataset has a rudimentary support to read data from bigquery using a sql query as follows:
my_spark_bq_dataset:
type: spark.SparkDataset
filepath: None # This is a required argument hence needs to be provided
load_args:
query: SELECT * FROM `my_project.my_dataset.my_table`
Prior to spark bigquery connector version 0.43.0, materializationDataset also was needed to be set, which spark.SparkDataset can't handle natively but spark.GBQQueryDataset can.
In summary, these are the pros for spark.GBQQueryDataset vs spark.SparkDataset:
- ✅ Support for providing
sqlquery directly as well as through a file (`filepath) placed on any filesystem - ✅ Separate handling of filesystem and bigquery credentials
- ✅ Native bigquery connector support
If this is accepted, it would also make sense to create spark.GBQTableDataset in line with the design of pandas.GBQQueryDataset and pandas.GBQTableDataset
lmk your thoughts on this! 🙁
@abhi8893
Thanks for the explanation. I am sorry for the late reply as I missed the GitHub notification.
This makes sense to me and strong enough argument to make gbq standalone for now. We may need to think about this harder in the future as I don't want to end up in the situation that we have two dataset per connectors.
Thank you again for your contribution!
Hi everyone! Looks like this PR has several approvals, what do we need to get it over the finish line?
Hi everyone! Looks like this PR has several approvals, what do we need to get it over the finish line?
@astrojuanlu this is addition to the core datasets, so needs 1/2 TSC approval.
Merging as we have 10/20 approvals.