bigquery-emulator
bigquery-emulator copied to clipboard
Wrong result from TO_JSON if REPEATED attribute is nested to another REPEATED attribute
What happened?
I faced some strange issue, if we have the following schema:
[
{
"name": "attributes",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "id",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "properties",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "propertyName",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "value",
"type": "STRING",
"mode": "NULLABLE"
}
]
}
]
}
]
This schema results in a wrong TO_JSON output (which I'll describe in the next section)
What did you expect to happen?
Expected to have this JSON:
{
"attributes": [
{
"id": "1",
"properties": [
[
{
"propertyName": "Test property",
"value": "Test value"
}
]
]
}
]
}
How can we reproduce it (as minimally and precisely as possible)?
If we create data using the following script:
import os
from google.api_core.client_options import ClientOptions
from google.api_core.retry import Retry
from google.auth.credentials import AnonymousCredentials
from google.cloud import bigquery
# Point to the BigQuery Emulator
os.environ["BIGQUERY_EMULATOR_HOST"] = "http://localhost:32831"
# Initialize BigQuery Client
client_options = ClientOptions(api_endpoint=os.environ.get("BIGQUERY_EMULATOR_HOST"))
client = bigquery.Client(
"test-project",
client_options=client_options,
credentials=AnonymousCredentials()
)
# Dataset and Table Details
dataset_id = 'test-dataset'
table_id = 'test_table'
def main():
# Step 1: Create the Dataset
dataset_ref = client.dataset(dataset_id)
dataset = bigquery.Dataset(dataset_ref)
client.create_dataset(dataset, exists_ok=True)
# Step 2: Define the Table Schema
schema = [
bigquery.SchemaField("attributes", "RECORD", mode="REPEATED", fields=[
bigquery.SchemaField("id", "STRING", mode="NULLABLE"),
bigquery.SchemaField("properties", "RECORD", mode="REPEATED", fields=[
bigquery.SchemaField("propertyName", "STRING", mode="NULLABLE"),
bigquery.SchemaField("value", "STRING", mode="NULLABLE"),
]),
]),
]
table_ref = dataset_ref.table(table_id)
table = bigquery.Table(table_ref, schema=schema)
client.create_table(table, exists_ok=True)
# Step 3: Insert Correctly Formatted Rows into the Table
rows_to_insert = [
{
"attributes": [
{
"id": "1",
"properties": [
{
"propertyName": "Test property",
"value": "Test value"
}
]
}
]
}
]
try:
errors = client.insert_rows_json(table=table, json_rows=rows_to_insert, retry=Retry(predicate=lambda e: False)) # Insert JSON rows
if errors:
print(f"Encountered errors while inserting rows: {errors}")
else:
print("Rows successfully inserted.")
except Exception as exc:
print(f"Failed to insert rows: {exc}")
if __name__ == "__main__":
main()
And then we execute the following SQL query against this test container:
SELECT TO_JSON(STRUCT(attributes)) FROM `test-dataset.test_table`
We get this JSON as a result:
{
"attributes": [
{
"id": "1",
"properties": [
[
{
"propertyName": "Test property"
},
{
"value": "Test value"
}
]
]
}
]
}
Anything else we need to know?
This might be related to REPEATED inside REPEATED, so if I change properties to be mode=NULLABLE, then I have the correct JSON
{
"attributes": [
{
"id": "1",
"properties": [
{
"propertyName": "Test property",
"value": "Test value"
}
]
}
]
}
This was fixed and released in the Recidiviz fork of the emulator v0.6.6-recidiviz.1
https://github.com/Recidiviz/bigquery-emulator/releases/tag/v0.6.6-recidiviz.1