extensions icon indicating copy to clipboard operation
extensions copied to clipboard

feat(firestore-bigquery-export): added JSON data type to the default bigquery schema

Open dackers86 opened this issue 2 years ago • 5 comments

The BigQuery extension currently uses a stringified JSON object to store data when synching from Firestore. This PR prepares for when a JSON type becomes available.

A new JSON datatype is currently in preview and can be tracked here https://cloud.google.com/bigquery/docs/reference/standard-sql/json-data

  • [x] Update the default schema to use a JSON datatype
  • [ ] Ensure new tables are generated with the new datatype

fixes: https://github.com/firebase/extensions/issues/1775

dackers86 avatar Mar 16 '22 10:03 dackers86

@dackers86 when is it planned to merge this PR? Sounds like a helpful addition 😍

IchordeDionysos avatar Oct 07 '22 21:10 IchordeDionysos

Thanks IchordeDionysos. Now that the JSON updated is out of it's original preview status we can resume development on this. The next steps will be to address the issues you have pointed out in terms of the views and path_params.

dackers86 avatar Oct 10 '22 13:10 dackers86

I was able to make this work by using SELECT …, ARRAY_AGG(data)[offset(0)] as data, instead of attempting to GROUP BY data since grouping by native JSON type is still unsupported.

However, this removes most of the advantages of the JSON type! When querying the resulting View, the entire data column is loaded (and billed) instead of a subset. For instance SELECT data.age FROM latest will read the same number of bytes as SELECT data FROM latest.

This is as opposed to SELECT data.age FROM changelog which only bills for the age column (since native JSON types are spread out into native subcolumns at ingestion time).

I don't know enough about BigQuery to suggest an alternative strategy, but I really hope there is a way to have our cake and eat it too here.

nfarina avatar Nov 09 '22 14:11 nfarina

Moving this back to under consideration for planning. This PR was an initial start but we need to consider and plan how this fits into the overall BQ schema and development.

dackers86 avatar Jan 03 '23 11:01 dackers86

We manually changed the datatype of the changelog-table to JSON and had indeed to deal with the same group by issue.

We fixed it in this way:

WITH RankedData AS (
  SELECT
    document_name,
    document_id,
    timestamp,
    event_id,
    operation,
    data,
    path_params,
    ROW_NUMBER() OVER(PARTITION BY document_name ORDER BY timestamp DESC) AS row_num
  FROM `collection`
)
SELECT
  document_name,
  document_id,
  timestamp,
  event_id,
  operation,
  data,
  path_params
FROM RankedData
WHERE row_num = 1 

Don't know enough about the inner workings /performance impact of BQ but for us it works fine!

basvandorst avatar Oct 30 '23 10:10 basvandorst