DataflowTemplates icon indicating copy to clipboard operation
DataflowTemplates copied to clipboard

[Bug]: spannerMetadataTableName missing from Dataflow Jobs UI

Open oulin-coder opened this issue 8 months ago • 1 comments

Related Template(s)

SpannerChangeStreamsToBigQuery

Template Version

v2

What happened?

I'm not sure if this is the right place to file this bug, but here's our situation:

We use a custom version of the SpannerChangeStreamsToBigQuery template (which we updated to support null primary keys and also a few other data types like FLOAT32). Recently we started having issues making in-place updates to running jobs due to new updates being incompatible (we updated apache-beam to 2.63.0 because of a warning in the Dataflow UI that our previous version, 2.54.0, is deprecated). So we downed the existing job and restarted a replacement job.

Previously, our job shows the newly created spannerMetadataTableName in the Dataflow Jobs UI under Job Info (screenshot). However, the newly created replacement job does not show this parameter (screenshot). We also tried running gcloud dataflow jobs describe <job id> --full, but it's not in the response either.

We finally managed to find the metadata table name by digging through our logs and finding a Spanner audit log for creating a table prefixed with "Metadata_dataflow_metadata_" (see log output for full log) around the time when we started the new job (with no connection to the Dataflow job name or job ID). Given that we need this metadata table name for future in-place updates of the job, this seems prohibitively difficult.

The fact that the metadata table used to show up in Dataflow Jobs UI and no longer does after upgrading apache-beam seems like a bug.

Relevant log output

{
  "protoPayload": {
    "@type": "type.googleapis.com/google.cloud.audit.AuditLog",
    "status": {},
    "serviceName": "spanner.googleapis.com",
    "methodName": "google.spanner.admin.database.v1.DatabaseAdmin.UpdateDatabaseDdl",
    "resourceName": "projects/chorus-scout/instances/scout/databases/dataflow-metadata",
    "response": {
      "commitTimestamps": [
        "2025-04-02T21:01:08.291456Z",
        "2025-04-02T21:01:08.291456Z",
        "2025-04-02T21:01:08.291456Z"
      ],
      "database": "projects/chorus-scout/instances/scout/databases/dataflow-metadata",
      "statements": [
        "CREATE TABLE IF NOT EXISTS Metadata_dataflow_metadata_83683738_c4cf_4d7f_a20e_a28501031b26 (\n  PartitionToken STRING(MAX) NOT NULL,\n  ParentTokens ARRAY<STRING(MAX)> NOT NULL,\n  StartTimestamp TIMESTAMP NOT NULL,\n  EndTimestamp TIMESTAMP NOT NULL,\n  HeartbeatMillis INT64 NOT NULL,\n  State STRING(MAX) NOT NULL,\n  Watermark TIMESTAMP NOT NULL,\n  CreatedAt TIMESTAMP NOT NULL OPTIONS (\n    allow_commit_timestamp = true\n  ),\n  ScheduledAt TIMESTAMP OPTIONS (\n    allow_commit_timestamp = true\n  ),\n  RunningAt TIMESTAMP OPTIONS (\n    allow_commit_timestamp = true\n  ),\n  FinishedAt TIMESTAMP OPTIONS (\n    allow_commit_timestamp = true\n  ),\n) PRIMARY KEY(PartitionToken), ROW DELETION POLICY (OLDER_THAN(FinishedAt, INTERVAL 1 DAY))",
        "CREATE INDEX IF NOT EXISTS WatermarkIdx_dataflow_metadata_83683738_c4cf_4d7f_a20e_a2850103 ON Metadata_dataflow_metadata_83683738_c4cf_4d7f_a20e_a28501031b26(Watermark) STORING (State)",
        "CREATE INDEX IF NOT EXISTS CreatedAtIdx_dataflow_metadata_83683738_c4cf_4d7f_a20e_a2850103 ON Metadata_dataflow_metadata_83683738_c4cf_4d7f_a20e_a28501031b26(CreatedAt, StartTimestamp)"
      ],
      "@type": "type.googleapis.com/google.spanner.admin.database.v1.UpdateDatabaseDdlMetadata"
    }
  },
  "insertId": "1u2stawa0",
  "resource": {
    "type": "spanner_instance",
    "labels": {
      "instance_id": "scout",
      "instance_config": "",
      "location": "us-central1",
      "project_id": "chorus-scout"
    }
  },
  "timestamp": "2025-04-02T21:01:08.428021563Z",
  "severity": "NOTICE",
  "logName": "projects/chorus-scout/logs/cloudaudit.googleapis.com%2Factivity",
  "operation": {
    "id": "projects/chorus-scout/instances/scout/databases/dataflow-metadata/operations/r7d05288e_079a_41e3_ac46_dc91adffec0c",
    "producer": "spanner.googleapis.com",
    "last": true
  },
  "receiveTimestamp": "2025-04-02T21:01:10.355192518Z"
}

oulin-coder avatar Apr 03 '25 16:04 oulin-coder

@oulin-coder , till apache-beam version 2.57.0, we were setting metadataTable in pipeline options here but in 2.58.0 version, this setting was removed from here with PR with reason that setting pipeline options from SpannerIO under the hood, causing potential clashes with user defined options, or when the connector is used more than once in the same pipeline (and using conflicting values).

Hence, to workaround this, you can check for dataflow job logs with filter "Partition metadata table that will be used is" in cloud logging where you can find the name of automatically created metadataTable.

Please let me know if we can close this issue now.

TanuSharma2511 avatar May 03 '25 04:05 TanuSharma2511

I must be missing something. The solution surely can't be to dig through the logs and manually copy paste the table name elsewhere for use later?

Is the solution in this case above to set a metadata table name in your own config and pass that in when the job starts, avoiding the automatic creation in the first place? I feel like I'm missing something. I also don't understand this comment here in the docs.

tomnewton avatar Jul 10 '25 14:07 tomnewton