extensions
extensions copied to clipboard
feat(firestore-bigquery-export): added JSON data type to the default bigquery schema
The BigQuery extension currently uses a stringified
JSON object to store data when synching from Firestore. This PR prepares for when a JSON type becomes available.
A new JSON datatype is currently in preview and can be tracked here https://cloud.google.com/bigquery/docs/reference/standard-sql/json-data
- [x] Update the default schema to use a JSON datatype
- [ ] Ensure new tables are generated with the new datatype
fixes: https://github.com/firebase/extensions/issues/1775
@dackers86 when is it planned to merge this PR? Sounds like a helpful addition 😍
Thanks IchordeDionysos. Now that the JSON updated is out of it's original preview
status we can resume development on this. The next steps will be to address the issues you have pointed out in terms of the views
and path_params
.
I was able to make this work by using SELECT …, ARRAY_AGG(data)[offset(0)] as data,
instead of attempting to GROUP BY data
since grouping by native JSON type is still unsupported.
However, this removes most of the advantages of the JSON type! When querying the resulting View, the entire data
column is loaded (and billed) instead of a subset. For instance SELECT data.age FROM latest
will read the same number of bytes as SELECT data FROM latest
.
This is as opposed to SELECT data.age FROM changelog
which only bills for the age
column (since native JSON types are spread out into native subcolumns at ingestion time).
I don't know enough about BigQuery to suggest an alternative strategy, but I really hope there is a way to have our cake and eat it too here.
Moving this back to under consideration
for planning. This PR was an initial start but we need to consider and plan how this fits into the overall BQ schema and development.
We manually changed the datatype of the changelog-table to JSON and had indeed to deal with the same group by issue.
We fixed it in this way:
WITH RankedData AS (
SELECT
document_name,
document_id,
timestamp,
event_id,
operation,
data,
path_params,
ROW_NUMBER() OVER(PARTITION BY document_name ORDER BY timestamp DESC) AS row_num
FROM `collection`
)
SELECT
document_name,
document_id,
timestamp,
event_id,
operation,
data,
path_params
FROM RankedData
WHERE row_num = 1
Don't know enough about the inner workings /performance impact of BQ but for us it works fine!