[firestore-bigquery-export] Disable changelog table
[READ] Step 1: Are you in the right place?
Issues filed here should be about a feature request for a specific extension in this repository. To file a feature request that affects multiple extensions or the Firebase Extensions platform, please reach out to Firebase support directly.
[REQUIRED] Step 2: Extension name
This feature request is for extension: firestore-bigquery-export
What feature would you like to see?
I'd like there to be an option to remove the changelog table in BigQuery as I'm only interested in current state of Firestore being visible in BigQuery, and my changelog would fill up fast and use a lot of space in BigQuery.
How would you use it?
Would use it to disable changelog dataset, as I'm only interested in the current content.
Yes please.
+1
We will have to investigate what's possible here, as I believe the changelog is how the latest views etc are generated. Appreciate that this is popular request, we're looking into it :)
I would be very great, here we just want the latest version of a document and I've had to setup a job that deletes old snapshots periodicly.
To solve this problem, I use a scheduled query that cleans the data once a month. If you have a lot of data, you can reduce the interval and range (reducing the interval without reducing the range will lead to an increase in the cost of completing the task, since it passes through a large amount of data).
The query performed by the task: (*replace projectID and prefix_table_1, prefix_table_2, etc. with your own project ID and tables):
CALL `projectID`.`firestore_export`.`clean_changelog_all`(['prefix_table_1_raw_changelog','prefix_table_2_raw_changelog']);
The procedure that starts the cleanup of each table: (*replace projectID with your own project ID):
CREATE OR REPLACE PROCEDURE `projectID.firestore_export.clean_changelog_all`(tableNames ARRAY<STRING>)
BEGIN
FOR t IN (SELECT * FROM UNNEST(tableNames) AS name)
DO
CALL `projectID.firestore_export.clean_changelog`(t.name);
END FOR;
END;
The procedure for clearing the table: (*replace projectID with your own project ID and (optional) INTERVAL -1 MONTH):
CREATE OR REPLACE PROCEDURE `projectID.firestore_export.clean_changelog`(tableName STRING)
BEGIN
DECLARE query STRING;
SET query = CONCAT("DELETE FROM `projectID.firestore_export.", tableName, "` ","\n",
"WHERE (document_name, timestamp) IN","\n",
"(","\n",
" WITH latest AS (","\n",
" SELECT MAX(timestamp) as timestamp, document_name","\n",
" FROM `projectID.firestore_export.",tableName,"`","\n",
" GROUP BY document_name","\n",
" )","\n",
" SELECT (t.document_name, t.timestamp)","\n",
" FROM `projectID.firestore_export.",tableName,"` AS t","\n",
" JOIN latest ON (t.document_name = latest.document_name )","\n",
" WHERE t.timestamp != latest.timestamp","\n",
" AND DATETIME(t.timestamp) < DATE_ADD(CURRENT_DATETIME(), INTERVAL -1 MONTH)","\n",
");");
EXECUTE IMMEDIATE query;
END;
I will also note that deleting the previous version every time you write a new one takes quite a long time, so I do not recommend going this way. And also that update operations are quite slow in BigQuery and are not supported in tables using streaming insertion, so this option is also not suitable.
This is the manual solution I used.
DELETE FROM your_dataset.your_table
WHERE STRUCT(document_name, timestamp) NOT IN (
SELECT AS STRUCT document_id, MAX(timestamp) AS timestamp
FROM your_dataset.your_table
GROUP BY document_name
)
https://stackoverflow.com/a/78232504/10222449