feast icon indicating copy to clipboard operation
feast copied to clipboard

BigQueryRetrievalJob does not remove tables used for data export

Open mjurkus opened this issue 1 year ago • 5 comments

Expected Behavior

Temporary tables in BigQueryRetrievalJob should be removed after the job completes or fails.

Current Behavior

When running materialization with a batch engine with BigQuery, the historical_datestamp_hash table is created to export data from the BQ temporary table. Then, the data is extracted to GCS Bucket, but the table is always retained.

Steps to reproduce

Run the materialization job with BQ as the offline_store and use batch_engine i.e., bytewax.

Specifications

  • Version: 0.34.1

Possible Solution

Add cleanup in BigQueryRetrievalJob.to_remote_storage.

mjurkus avatar Nov 09 '23 07:11 mjurkus

I think there's a practice you can apply on your side first is to set up default table expiration for your BigQuery dataset. And do you mind creating a PR for this?

sudohainguyen avatar Nov 09 '23 14:11 sudohainguyen

PR for which part? Modify the historical_datestamp_hash table properties to make it expire? Or implement try / finally in BigQueryRetrievalJob.to_remote_storage once the export job is completed?

If modify table properties - that causes some complications: The same BigQueryRetrievalJob.to_bigquery function, where the table is created, is used to create a saved dataset via FeatureStore.create_saved_dataset.

mjurkus avatar Nov 09 '23 20:11 mjurkus

Probably the "try/finally" option is better for the current situation, with the risk that "to_remote_storage" can crash after the table is created.

shuchu avatar Nov 10 '23 04:11 shuchu

try/finally sounds great to me as well

sudohainguyen avatar Nov 12 '23 04:11 sudohainguyen

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Mar 17 '24 11:03 stale[bot]