extensions icon indicating copy to clipboard operation
extensions copied to clipboard

[firestore-bigquery-export] Support clustering with Firestore field

Open alfred-risb opened this issue 3 years ago • 7 comments

Since this extension already have partitioning feature base on Firestore field, can we have clustering feature base on Firestore field too?

alfred-risb avatar Jul 07 '22 06:07 alfred-risb

Hi @alfred-risb

Yes, clustering is enabled as a feature. Adding the required field under clustering while re-configuring should automatically add clustering to your BigQuery table.

Can you let me know if this is the information you are looking for. Thanks.

dackers86 avatar Jul 11 '22 08:07 dackers86

Hi @dackers86

This parameter will allow you to set up Clustering for the BigQuery Table created by the extension. (for example: data,document_id,timestamp- no whitespaces). You can select up to 4 comma separated fields(order matters). Available schema extensions table fields for clustering: document_id, timestamp, event_id, operation, data.

The clustering table must be a top-level field which is document_id, timestamp, event_id, operation and data. I need to cluster fields from firestore which will become nested json string in the data column

alfred-risb avatar Jul 14 '22 01:07 alfred-risb

I see what you mean!

Partitioning will automatically create a new top level field if the field does not exist in the schema. However, we currently do not do this for clustering.

Would creating a separate schema view solve this issue? https://github.com/firebase/extensions/blob/master/firestore-bigquery-export/guides/GENERATE_SCHEMA_VIEWS.md or is the minimum requirement that this should exist on the table.

If clustering is required as a top level field similar to how partitioning works, I can mark this as a feature request/investigation.

dackers86 avatar Jul 14 '22 08:07 dackers86

Hello, I agree that this feature would be extremely useful. Clustering would help reduce query costs, but there is currently no point in configuring clustering on any of the top-level fields (timetstamp, event_id, operation) other than document_id, because most queries filter on the nested fields of the data column. Perhaps a config option could be added to declare custom Firestore fields to be parsed from the data and written in a dedicated column. This would allow any field to be used for partitioning or clustering. Otherwise, creating the column(s) ad hoc as is currently happening for partitioning should also work.

enricobachiorrini avatar Dec 19 '22 00:12 enricobachiorrini

+1 Very interested in this. It would be super helpful.

Documentation says :

Available schema extensions table fields for clustering: document_id, timestamp, event_id, operation, data.

But fields inside the data json cannot be used. I get the following error in the extension cloud function logs when configuring a clustering field nested in my data json.

Unable to add clustering, field(s) myfield do not exist on the expected table

pldelattre avatar Jan 04 '23 18:01 pldelattre