[firestore-bigquery-export] Support clustering with Firestore field
Since this extension already have partitioning feature base on Firestore field, can we have clustering feature base on Firestore field too?
Hi @alfred-risb
Yes, clustering is enabled as a feature. Adding the required field under clustering while re-configuring should automatically add clustering to your BigQuery table.
Can you let me know if this is the information you are looking for. Thanks.
Hi @dackers86
This parameter will allow you to set up Clustering for the BigQuery Table created by the extension. (for example: data,document_id,timestamp- no whitespaces). You can select up to 4 comma separated fields(order matters). Available schema extensions table fields for clustering: document_id, timestamp, event_id, operation, data.
The clustering table must be a top-level field which is document_id, timestamp, event_id, operation and data. I need to cluster fields from firestore which will become nested json string in the data column
I see what you mean!
Partitioning will automatically create a new top level field if the field does not exist in the schema. However, we currently do not do this for clustering.
Would creating a separate schema view solve this issue? https://github.com/firebase/extensions/blob/master/firestore-bigquery-export/guides/GENERATE_SCHEMA_VIEWS.md or is the minimum requirement that this should exist on the table.
If clustering is required as a top level field similar to how partitioning works, I can mark this as a feature request/investigation.
Hello, I agree that this feature would be extremely useful. Clustering would help reduce query costs, but there is currently no point in configuring clustering on any of the top-level fields (timetstamp, event_id, operation) other than document_id, because most queries filter on the nested fields of the data column. Perhaps a config option could be added to declare custom Firestore fields to be parsed from the data and written in a dedicated column. This would allow any field to be used for partitioning or clustering. Otherwise, creating the column(s) ad hoc as is currently happening for partitioning should also work.
+1 Very interested in this. It would be super helpful.
Documentation says :
Available schema extensions table fields for clustering: document_id, timestamp, event_id, operation, data.
But fields inside the data json cannot be used. I get the following error in the extension cloud function logs when configuring a clustering field nested in my data json.
Unable to add clustering, field(s) myfield do not exist on the expected table