iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

Add service account impersonation support for BigQueryMetastoreCatalog

Open joyhaldar opened this issue 2 months ago • 7 comments

Description: This PR adds service account impersonation support to BigQueryMetastoreCatalog, enabling identity separation between cluster operations and data access

Problem BigQueryMetastoreCatalog only supports Application Default Credentials with no mechanism for service account impersonation. This prevents:

  • Implementing least-privilege security (cluster operations vs data access)
  • Running multi-tenant workloads on shared clusters
  • Creating proper audit trails per service account

Solution Introduces a pluggable factory pattern for BigQuery client creation with impersonation support using Google's ImpersonatedCredentials API.

Key changes:

  • Created BigQueryClientFactory interface with DefaultBigQueryClientFactory (ADC) and ImpersonatedBigQueryClientFactory (impersonation)
  • Added impersonation properties to GCPProperties: service account, delegates, lifetime, scopes
  • Updated BigQueryMetastoreCatalog to use factory pattern
  • Propagated impersonation settings to GCS operations via PrefixedStorage

Configuration

Minimal:

gcp.bigquery.client.factory=org.apache.iceberg.gcp.bigquery.ImpersonatedBigQueryClientFactory
gcp.impersonate.service-account=data-sa@project.iam.gserviceaccount.com

Full:

gcp.bigquery.client.factory=org.apache.iceberg.gcp.bigquery.ImpersonatedBigQueryClientFactory
gcp.impersonate.service-account=data-sa@project.iam.gserviceaccount.com
gcp.impersonate.delegates=admin-sa@project.iam.gserviceaccount.com
gcp.impersonate.lifetime-seconds=3600
gcp.impersonate.scopes=bigquery,devstorage.read_only

Testing

Added unit tests:

  • TestDefaultBigQueryClientFactory
  • TestImpersonatedBigQueryClientFactory
  • TestBigQueryCatalog
  • TestGCPProperties
  • TestPrefixedStorage

Backward Compatibility Fully backward compatible, catalogs without factory configuration continue using ADC exactly as before.

Closes #14446

joyhaldar avatar Oct 30 '25 08:10 joyhaldar

@talatuyarer can you review this one please?

nastra avatar Oct 30 '25 09:10 nastra

Hello @talatuyarer, @ebyhr, @nastra, i would really appreciate it if you could please take a look if you have some time.

joyhaldar avatar Oct 30 '25 10:10 joyhaldar

@joyhaldar Thanks for working at this! I think service account impersonation is relevant also outside of the BigQueryMetastoreCatalog context - basically for every workload running on GCP, so I would just make sure it's generic enough to be used regardless of the catalog type

yogevyuval avatar Nov 03 '25 13:11 yogevyuval

@joyhaldar Thanks for working at this! I think service account impersonation is relevant also outside of the BigQueryMetastoreCatalog context - basically for every workload running on GCP, so I would just make sure it's generic enough to be used regardless of the catalog type

Hi @yogevyuval,

Thanks for the feedback! I really appreciate it.

I wrote this to follow current patterns, for example AssumeRoleAwsClientFactory also only works with AWS catalogs if I am not wrong (please correct me if I am). I also think that users can handle impersonation at the application level for other catalog types if needed.

I personally think this would be best addressed in a follow-up PR to keep the scope focused, but I'm happy to try and expand this PR to support any catalog now if you and the other reviewers think that's a good idea.

Please let me know what you think.

Thanks, Joy

joyhaldar avatar Nov 03 '25 14:11 joyhaldar

@joyhaldar Thanks for working at this! I think service account impersonation is relevant also outside of the BigQueryMetastoreCatalog context - basically for every workload running on GCP, so I would just make sure it's generic enough to be used regardless of the catalog type

Hi @yogevyuval,

Thanks for the feedback! I really appreciate it.

I wrote this to follow current patterns, for example AssumeRoleAwsClientFactory also only works with AWS catalogs if I am not wrong (please correct me if I am). I also think that users can handle impersonation at the application level for other catalog types if needed.

I personally think this would be best addressed in a follow-up PR to keep the scope focused, but I'm happy to try and expand this PR to support any catalog now if you and the other reviewers think that's a good idea.

Please let me know what you think.

Thanks, Joy

So what I meant is a situation where a lakehouse is hosted on GCP but with a self managed catalog, such as polaris/hive metastore, but the files would still be hosted in GCS, that's where the impersonation can really be useful even when not using BigQuery

yogevyuval avatar Nov 03 '25 18:11 yogevyuval

Is the service account impersonation support for the catalog, fileio, or both?

I see there's already a GoogleAuthManager class for handling auth and google credential. It uses GoogleCredentials.fromStream which already supports ImpersonatedCredentials

Could we reuse the GoogleAuthManager to abstract away the auth details?

kevinjqliu avatar Nov 07 '25 20:11 kevinjqliu

Is the service account impersonation support for the catalog, fileio, or both?

I see there's already a GoogleAuthManager class for handling auth and google credential. It uses GoogleCredentials.fromStream which already supports ImpersonatedCredentials

Could we reuse the GoogleAuthManager to abstract away the auth details?

Thank you for the comment, Kevin.

The impersonation supports both BigQuery and GCS FileIO.

Regarding GoogleAuthManager, I was under the impression that it's designed for REST Catalog authentication, while BigQueryMetastoreCatalog uses GoogleCredentials directly with GCP client libraries.

Please let me know if I'm misunderstanding your suggestion.

joyhaldar avatar Nov 08 '25 03:11 joyhaldar