datahub icon indicating copy to clipboard operation
datahub copied to clipboard

Lineage, BigQuery - lineage from tables/views created via Terraform (REST API) is not being ingested

Open vgaidass opened this issue 2 years ago • 3 comments

Describe the bug Metadata ingestion doesn't extract lineage from tables or views created via Terraform (that uses REST API) as DataHub can't find relevant logs in cloudaudit_googleapis_com_data_access table.

Screenshots Visualization of the issue:

image

To Reproduce Prerequisite: Cloud Logging needs to sink logs to BigQuery Steps to reproduce the behavior:

  1. Create Terraform resources for BQ table and view:
resource "google_bigquery_table" "test_tbl" {
  project    = "${YOUR_PROJECT_ID}"
  dataset_id = "${YOUR_DATASET_ID}"
  table_id   = "test_tbl"
  schema = jsonencode([
    {
      "mode" : "NULLABLE",
      "name" : "col_1",
      "type" : "STRING"
    },
    {
      "mode" : "NULLABLE",
      "name" : "col_2",
      "type" : "INTEGER"
    }
  ])

  deletion_protection = false
}

resource "google_bigquery_table" "test_view" {
  project    = "${YOUR_PROJECT_ID}"
  dataset_id = "${YOUR_DATASET_ID}"
  table_id   = "test_view"
  view {
    query          = "YOUR SQL QUERY"
    use_legacy_sql = false
  }

  deletion_protection = false
}
  1. Apply Terraform plan to your GCP project
  2. Try to run the following query against exported audit logs table cloudaudit_googleapis_com_data_access. You won't find any traces of created objects
SELECT
  timestamp,
  logName,
  insertId,
  protopayload_auditlog AS protoPayload,
  protopayload_auditlog.metadataJson AS metadata
FROM
  `YOUR_PROJECT.YOUR_DATASET.cloudaudit_googleapis_com_data_access`
WHERE
  timestamp >= "2022-01-01"
  AND timestamp < "2022-12-31"
  AND protopayload_auditlog.serviceName="bigquery.googleapis.com"
  AND JSON_EXTRACT_SCALAR(protopayload_auditlog.metadataJson,
    "$.jobChange.job.jobStatus.jobState") = "DONE"
  AND JSON_EXTRACT(protopayload_auditlog.metadataJson,
    "$.jobChange.job.jobStatus.errorResults") IS NULL
  AND JSON_EXTRACT(protopayload_auditlog.metadataJson,
    "$.jobChange.job.jobConfig.queryConfig") IS NOT NULL
  AND (JSON_EXTRACT(protopayload_auditlog.metadataJson,
      "$.jobChange.job.jobConfig.queryConfig.query") LIKE "%test_tbl%"
    OR JSON_EXTRACT(protopayload_auditlog.metadataJson,
      "$.jobChange.job.jobConfig.queryConfig.query") LIKE "%test_view%")
  AND JSON_EXTRACT(protopayload_auditlog.metadataJson,
    "$.jobChange.job.jobConfig.queryConfig.query") NOT LIKE "%cloudaudit_googleapis_com_data_access%"
  AND JSON_EXTRACT(protopayload_auditlog.metadataJson,
    "$.jobChange.job.jobConfig.queryConfig.query") NOT LIKE "%cloudaudit_googleapis_com_activity%"
  1. Try to run the following query against exported audit logs table cloudaudit_googleapis_com_activity. You will find entries for tables/views created from Terraform and (theoretically) any other tool that uses REST API
SELECT
  timestamp,
  logName,
  insertId,
  protopayload_auditlog AS protoPayload,
  protopayload_auditlog.methodName AS methodName,
  protopayload_auditlog.resourceName AS resourceName,
  protopayload_auditlog.metadataJson AS metadata
FROM
  `YOUR_PROJECT.YOUR_DATASET.cloudaudit_googleapis_com_activity`
WHERE
  timestamp >= "2022-01-01"
  AND timestamp < "2022-12-31"
  AND protopayload_auditlog.serviceName="bigquery.googleapis.com"
  AND protopayload_auditlog.methodName = "google.cloud.bigquery.v2.TableService.InsertTable"
  AND (protopayload_auditlog.resourceName LIKE "%test_tbl"
    OR protopayload_auditlog.resourceName LIKE "%test_view")
ORDER BY
  timestamp DESC

How protopayload_auditlog.metadataJson looks like for BQ table:

{
    "@type": "type.googleapis.com/google.cloud.audit.BigQueryAuditMetadata",
    "tableCreation": {
        "reason": "TABLE_INSERT_REQUEST",
        "table": {
            "createTime": "2022-04-26T07:46:53.862Z",
            "schemaJson": "{\n  \"fields\": [{\n    \"name\": \"col_1\",\n    \"type\": \"STRING\",\n    \"mode\": \"NULLABLE\"\n  }, {\n    \"name\": \"col_2\",\n    \"type\": \"INTEGER\",\n    \"mode\": \"NULLABLE\"\n  }]\n}",
            "tableName": "projects/project/datasets/dataset/tables/test_tbl",
            "updateTime": "2022-04-26T07:46:53.902Z"
        }
    }
}

How protopayload_auditlog.metadataJson looks like for BQ view:

{
    "@type": "type.googleapis.com/google.cloud.audit.BigQueryAuditMetadata",
    "tableCreation": {
        "reason": "TABLE_INSERT_REQUEST",
        "table": {
            "createTime": "2022-04-26T07:46:53.893Z",
            "schemaJson": "{\n  \"fields\": [{\n    \"name\": \"col_1\",\n    \"type\": \"STRING\",\n    \"mode\": \"NULLABLE\"\n  },  {\n    \"name\": \"col_2\",\n    \"type\": \"INTEGER\",\n    \"mode\": \"NULLABLE\"\n  }]\n}",
            "tableName": "projects/project/datasets/dataset/tables/test_view",
            "updateTime": "2022-04-26T07:46:54.087Z",
            "view": {
                "query": "SELECT * FROM `project.dataset.tbl_1` UNION ALL SELECT * FROM `project.dataset.tbl_2`"
            }
        }
    }
}

Expected behavior DataHub needs to cover cases when tables/views were created via REST API by Terraform or other tools. That is especially important for extracting view lineage as its creation won't be visible in cloudaudit_googleapis_com_data_access as the view was not created by a BQ job but an API call instead.

That implies that query contents from protopayload_auditlog.metadataJson.tableCreation.table.view.query needs to be parsed by sqlparser and pushed to sqllineage.

This issue seems to be relevant not only for exported audit log tables (use_exported_bigquery_audit_metadata: true) but also for direct extracts from Cloud Logging as query templates (both v1 and v2) are targeting BQ job service only.

vgaidass avatar Apr 26 '22 08:04 vgaidass

@anshbansal @jjoyce0510 Raising awareness as this seems to be quite a big thing.

vgaidass avatar Apr 26 '22 08:04 vgaidass

Hi @vgaidass, thanks so much for the level of detail here - definitely something we need to address.

Hi @anshbansal - please take a look at this one!

maggiehays avatar Apr 26 '22 19:04 maggiehays

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io

github-actions[bot] avatar Sep 15 '22 06:09 github-actions[bot]

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io

github-actions[bot] avatar Oct 16 '22 02:10 github-actions[bot]

This issue was closed because it has been inactive for 30 days since being marked as stale.

github-actions[bot] avatar Nov 16 '22 02:11 github-actions[bot]