[question] Cloning catalog db for read-only at specific epoch.
I have a question regarding the use of DuckDB or SQLite as a catalog database.
In my setup ( the idea):
- current approach:
- run a daily ingestion job that creates a collection of parquet files.
- have a custom UI dashboard that uses DuckDB in read-only mode, querying directly from Parquet files.
- new:
- run a daily ingestion job that updates the catalog and retains the full history.
- have a custom UI dashboard that reads from a shallow clone of the catalog db in read only
What I’d like to do is the following: After each ingestion job completes, I want to create a copy (or snapshot) of the catalog database at that specific point in time (epoch). This copy would then be used only by the UI Dashboard in read-only mode, without exposing the live catalog database.
Is this possible/recommended when using DuckDB or SQLite as the catalog backend? Are there any best practices or pitfalls I should be aware of when copying the database file for read-only consumption?
Thanks in advance for your help and for the great work on this project!
I think this would work well with either DuckDB or SQLite. It should be fine to create a copy of the metadata database - as long as that copy is not made while changes are being made to the file (i.e. after the ingestion job has finished). Alternatively, you could use a SQLite database as a catalog and use time travel to pin a specific snapshot for the UI.
Unfortunately, at least in my case, using the same file won't be possibile, because it would require creating locks on remote files on azure storage (which is not possibile, since write operations are not possibile).
I will try to use a clone after the ingestion job completes.
Maybe it would be cool to be able to export the catalog database at a specific "snapshot" and even removing history.
You could use the ducklake_expire_snapshots to remove history manually.