bigquery-emulator icon indicating copy to clipboard operation
bigquery-emulator copied to clipboard

Wrong result from TO_JSON if REPEATED attribute is nested to another REPEATED attribute

Open david-blulyan opened this issue 7 months ago • 1 comments

What happened?

I faced some strange issue, if we have the following schema:

[
  {
    "name": "attributes",
    "type": "RECORD",
    "mode": "REPEATED",
    "fields": [
      {
        "name": "id",
        "type": "STRING",
        "mode": "NULLABLE"
      },
      {
        "name": "properties",
        "type": "RECORD",
        "mode": "REPEATED",
        "fields": [
          {
            "name": "propertyName",
            "type": "STRING",
            "mode": "NULLABLE"
          },
          {
            "name": "value",
            "type": "STRING",
            "mode": "NULLABLE"
          }
        ]
      }
    ]
  }
]

This schema results in a wrong TO_JSON output (which I'll describe in the next section)

What did you expect to happen?

Expected to have this JSON:

{
  "attributes": [
    {
      "id": "1",
      "properties": [
        [
          {
            "propertyName": "Test property",
            "value": "Test value"
          }
        ]
      ]
    }
  ]
}

How can we reproduce it (as minimally and precisely as possible)?

If we create data using the following script:

import os

from google.api_core.client_options import ClientOptions
from google.api_core.retry import Retry
from google.auth.credentials import AnonymousCredentials
from google.cloud import bigquery

# Point to the BigQuery Emulator
os.environ["BIGQUERY_EMULATOR_HOST"] = "http://localhost:32831"

# Initialize BigQuery Client
client_options = ClientOptions(api_endpoint=os.environ.get("BIGQUERY_EMULATOR_HOST"))
client = bigquery.Client(
  "test-project",
  client_options=client_options,
  credentials=AnonymousCredentials()
)

# Dataset and Table Details
dataset_id = 'test-dataset'
table_id = 'test_table'

def main():
    # Step 1: Create the Dataset
    dataset_ref = client.dataset(dataset_id)
    dataset = bigquery.Dataset(dataset_ref)
    client.create_dataset(dataset, exists_ok=True)

    # Step 2: Define the Table Schema
    schema = [
        bigquery.SchemaField("attributes", "RECORD", mode="REPEATED", fields=[
            bigquery.SchemaField("id", "STRING", mode="NULLABLE"),
            bigquery.SchemaField("properties", "RECORD", mode="REPEATED", fields=[
                bigquery.SchemaField("propertyName", "STRING", mode="NULLABLE"),
                bigquery.SchemaField("value", "STRING", mode="NULLABLE"),
            ]),
        ]),
    ]

    table_ref = dataset_ref.table(table_id)
    table = bigquery.Table(table_ref, schema=schema)
    client.create_table(table, exists_ok=True)

    # Step 3: Insert Correctly Formatted Rows into the Table
    rows_to_insert = [
        {
            "attributes": [
                {
                    "id": "1",
                    "properties": [
                        {
                            "propertyName": "Test property",
                            "value": "Test value"
                        }
                    ]
                }
            ]
        }
    ]

    try:
        errors = client.insert_rows_json(table=table, json_rows=rows_to_insert, retry=Retry(predicate=lambda e: False))  # Insert JSON rows
        if errors:
            print(f"Encountered errors while inserting rows: {errors}")
        else:
            print("Rows successfully inserted.")
    except Exception as exc:
        print(f"Failed to insert rows: {exc}")

if __name__ == "__main__":
    main()

And then we execute the following SQL query against this test container:

SELECT TO_JSON(STRUCT(attributes)) FROM `test-dataset.test_table`

We get this JSON as a result:

{
  "attributes": [
    {
      "id": "1",
      "properties": [
        [
          {
            "propertyName": "Test property"
          },
          {
            "value": "Test value"
          }
        ]
      ]
    }
  ]
}

Anything else we need to know?

This might be related to REPEATED inside REPEATED, so if I change properties to be mode=NULLABLE, then I have the correct JSON

{
  "attributes": [
    {
      "id": "1",
      "properties": [
        {
          "propertyName": "Test property",
          "value": "Test value"
        }
      ]
    }
  ]
}

david-blulyan avatar May 20 '25 11:05 david-blulyan

This was fixed and released in the Recidiviz fork of the emulator v0.6.6-recidiviz.1 https://github.com/Recidiviz/bigquery-emulator/releases/tag/v0.6.6-recidiviz.1

ohaibbq avatar Nov 13 '25 03:11 ohaibbq