cartography icon indicating copy to clipboard operation
cartography copied to clipboard

[Feature] Add support for GCP BigQuery

Open kunaals opened this issue 2 weeks ago • 0 comments

Summary

Add support for ingesting GCP BigQuery resources into Cartography. BigQuery is Google Cloud's fully managed, serverless data warehouse that enables scalable analysis over petabytes of data. This feature would allow Cartography to track BigQuery datasets, tables, views, routines, models, and their access controls.

Motivation

BigQuery is a foundational data platform used across many organizations for analytics, ML, and data sharing. It represents a critical surface for security analysis due to the sensitive data it often contains. By ingesting BigQuery resources, Cartography can surface:

  • Dataset and table inventory across projects
  • IAM bindings at dataset and table levels (fine-grained access control)
  • External tables and their connections to GCS, Drive, or other sources
  • Views and their underlying table dependencies (authorized views)
  • Data sharing via Analytics Hub linked datasets
  • ML models and routines (stored procedures/functions)
  • Row-level and column-level security policies

This unlocks graph-based security analysis such as:

  • Identifying datasets/tables with overly permissive access
  • Tracking data lineage from source tables through views
  • Detecting publicly accessible datasets
  • Mapping which service accounts have access to sensitive tables
  • Finding external tables that expose data from GCS buckets
  • Auditing cross-project data sharing patterns

Proposed Solution

Extend the GCP intel module to call the BigQuery APIs and model the following resources:

New Nodes

Core Resources:

  • GCPBigQueryDataset - Top-level container for tables/views
  • GCPBigQueryTable - Data tables (native and external)
  • GCPBigQueryView - Logical and materialized views
  • GCPBigQueryRoutine - Functions, procedures, remote functions
  • GCPBigQueryModel - BigQuery ML models

Access Control:

  • GCPBigQueryRowAccessPolicy - Row-level security policies

Data Sharing (Analytics Hub):

  • GCPBigQueryDataExchange - Analytics Hub exchanges
  • GCPBigQueryListing - Published dataset listings
  • GCPBigQueryLinkedDataset - Subscribed linked datasets

Connections:

  • GCPBigQueryConnection - External data source connections (Cloud SQL, Spanner, etc.)

New Relationships

Hierarchy:

  • (:GCPProject)-[:RESOURCE]->(:GCPBigQueryDataset)
  • (:GCPBigQueryDataset)-[:CONTAINS]->(:GCPBigQueryTable)
  • (:GCPBigQueryDataset)-[:CONTAINS]->(:GCPBigQueryView)
  • (:GCPBigQueryDataset)-[:CONTAINS]->(:GCPBigQueryRoutine)
  • (:GCPBigQueryDataset)-[:CONTAINS]->(:GCPBigQueryModel)

Access & Security:

  • (:GCPBigQueryTable)-[:HAS_ROW_ACCESS_POLICY]->(:GCPBigQueryRowAccessPolicy)
  • (:GCPBigQueryDataset)-[:ALLOWS_ACCESS]->(:GCPServiceAccount|GCPUser|GCPGroup) (via IAM bindings)
  • (:GCPBigQueryTable)-[:ALLOWS_ACCESS]->(:GCPServiceAccount|GCPUser|GCPGroup) (table-level IAM)

Data Lineage & Dependencies:

  • (:GCPBigQueryView)-[:REFERENCES]->(:GCPBigQueryTable) (view dependencies)
  • (:GCPBigQueryView)-[:AUTHORIZED_FOR]->(:GCPBigQueryDataset) (authorized views)
  • (:GCPBigQueryTable)-[:EXTERNAL_SOURCE]->(:GCSBucket) (external tables backed by GCS)
  • (:GCPBigQueryTable)-[:USES_CONNECTION]->(:GCPBigQueryConnection) (BigLake/external connections)

Data Sharing:

  • (:GCPBigQueryDataExchange)-[:HAS_LISTING]->(:GCPBigQueryListing)
  • (:GCPBigQueryListing)-[:SHARES]->(:GCPBigQueryDataset)
  • (:GCPBigQueryLinkedDataset)-[:SUBSCRIBED_TO]->(:GCPBigQueryListing)

Key Properties

GCPBigQueryDataset:

  • id, dataset_id, project_id
  • location, default_table_expiration_ms
  • labels, description
  • creation_time, last_modified_time

GCPBigQueryTable:

  • id, table_id, dataset_id, project_id
  • type (TABLE, VIEW, MATERIALIZED_VIEW, EXTERNAL, SNAPSHOT)
  • location, num_bytes, num_rows
  • creation_time, last_modified_time, expiration_time
  • clustering_fields, time_partitioning
  • encryption_configuration (CMEK)

GCPBigQueryView:

  • id, view_id, dataset_id, project_id
  • query (defining SQL)
  • use_legacy_sql
  • materialized (boolean)

GCP APIs to Integrate

  • bigquery.googleapis.com - BigQuery API v2
    • datasets.list, datasets.get
    • tables.list, tables.get
    • routines.list, routines.get
    • models.list, models.get
    • rowAccessPolicies.list
  • analyticshub.googleapis.com - Analytics Hub API
    • dataExchanges.list
    • listings.list
  • bigqueryconnection.googleapis.com - BigQuery Connection API
    • connections.list

Alternatives Considered

  • Using Cloud Asset Inventory for BigQuery - CAI provides basic metadata but misses table-level details, view definitions, and row access policies
  • Focusing only on datasets - misses the table-level granularity needed for data security analysis
  • Skipping Analytics Hub - would miss important cross-org data sharing patterns

Relevant Links

kunaals avatar Dec 08 '25 23:12 kunaals