Add service account impersonation support for BigQueryMetastoreCatalog
Description: This PR adds service account impersonation support to BigQueryMetastoreCatalog, enabling identity separation between cluster operations and data access
Problem BigQueryMetastoreCatalog only supports Application Default Credentials with no mechanism for service account impersonation. This prevents:
- Implementing least-privilege security (cluster operations vs data access)
- Running multi-tenant workloads on shared clusters
- Creating proper audit trails per service account
Solution Introduces a pluggable factory pattern for BigQuery client creation with impersonation support using Google's ImpersonatedCredentials API.
Key changes:
- Created BigQueryClientFactory interface with DefaultBigQueryClientFactory (ADC) and ImpersonatedBigQueryClientFactory (impersonation)
- Added impersonation properties to GCPProperties: service account, delegates, lifetime, scopes
- Updated BigQueryMetastoreCatalog to use factory pattern
- Propagated impersonation settings to GCS operations via PrefixedStorage
Configuration
Minimal:
gcp.bigquery.client.factory=org.apache.iceberg.gcp.bigquery.ImpersonatedBigQueryClientFactory
gcp.impersonate.service-account=data-sa@project.iam.gserviceaccount.com
Full:
gcp.bigquery.client.factory=org.apache.iceberg.gcp.bigquery.ImpersonatedBigQueryClientFactory
gcp.impersonate.service-account=data-sa@project.iam.gserviceaccount.com
gcp.impersonate.delegates=admin-sa@project.iam.gserviceaccount.com
gcp.impersonate.lifetime-seconds=3600
gcp.impersonate.scopes=bigquery,devstorage.read_only
Testing
Added unit tests:
- TestDefaultBigQueryClientFactory
- TestImpersonatedBigQueryClientFactory
- TestBigQueryCatalog
- TestGCPProperties
- TestPrefixedStorage
Backward Compatibility Fully backward compatible, catalogs without factory configuration continue using ADC exactly as before.
Closes #14446
@talatuyarer can you review this one please?
Hello @talatuyarer, @ebyhr, @nastra, i would really appreciate it if you could please take a look if you have some time.
@joyhaldar Thanks for working at this! I think service account impersonation is relevant also outside of the BigQueryMetastoreCatalog context - basically for every workload running on GCP, so I would just make sure it's generic enough to be used regardless of the catalog type
@joyhaldar Thanks for working at this! I think service account impersonation is relevant also outside of the BigQueryMetastoreCatalog context - basically for every workload running on GCP, so I would just make sure it's generic enough to be used regardless of the catalog type
Hi @yogevyuval,
Thanks for the feedback! I really appreciate it.
I wrote this to follow current patterns, for example AssumeRoleAwsClientFactory also only works with AWS catalogs if I am not wrong (please correct me if I am). I also think that users can handle impersonation at the application level for other catalog types if needed.
I personally think this would be best addressed in a follow-up PR to keep the scope focused, but I'm happy to try and expand this PR to support any catalog now if you and the other reviewers think that's a good idea.
Please let me know what you think.
Thanks, Joy
@joyhaldar Thanks for working at this! I think service account impersonation is relevant also outside of the BigQueryMetastoreCatalog context - basically for every workload running on GCP, so I would just make sure it's generic enough to be used regardless of the catalog type
Hi @yogevyuval,
Thanks for the feedback! I really appreciate it.
I wrote this to follow current patterns, for example AssumeRoleAwsClientFactory also only works with AWS catalogs if I am not wrong (please correct me if I am). I also think that users can handle impersonation at the application level for other catalog types if needed.
I personally think this would be best addressed in a follow-up PR to keep the scope focused, but I'm happy to try and expand this PR to support any catalog now if you and the other reviewers think that's a good idea.
Please let me know what you think.
Thanks, Joy
So what I meant is a situation where a lakehouse is hosted on GCP but with a self managed catalog, such as polaris/hive metastore, but the files would still be hosted in GCS, that's where the impersonation can really be useful even when not using BigQuery
Is the service account impersonation support for the catalog, fileio, or both?
I see there's already a GoogleAuthManager class for handling auth and google credential. It uses GoogleCredentials.fromStream which already supports ImpersonatedCredentials
Could we reuse the GoogleAuthManager to abstract away the auth details?
Is the service account impersonation support for the catalog, fileio, or both?
I see there's already a GoogleAuthManager class for handling auth and google credential. It uses GoogleCredentials.fromStream which already supports ImpersonatedCredentials
Could we reuse the GoogleAuthManager to abstract away the auth details?
Thank you for the comment, Kevin.
The impersonation supports both BigQuery and GCS FileIO.
Regarding GoogleAuthManager, I was under the impression that it's designed for REST Catalog authentication, while BigQueryMetastoreCatalog uses GoogleCredentials directly with GCP client libraries.
Please let me know if I'm misunderstanding your suggestion.