feat: Implement caching
Ticket(s) Closed
- Closes #
What
CREATE FOREIGN TABLE now has a cache option. If set to true, a DuckDB table instead of view is created. This stores the table in DuckDB format inside the Postgres data directory.
CREATE FOREIGN DATA WRAPPER parquet_wrapper
HANDLER parquet_fdw_handler VALIDATOR parquet_fdw_validator;
-- Provide S3 credentials
CREATE SERVER parquet_server FOREIGN DATA WRAPPER parquet_wrapper;
-- Create foreign table with auto schema creation
CREATE FOREIGN TABLE trips ()
SERVER parquet_server
OPTIONS (files 's3://paradedb-benchmarks/yellow_tripdata_2024-01.parquet', cache 'true');
-- Success! Now you can query the remote Parquet file like a regular Postgres table
SELECT COUNT(*) FROM trips;
Why
Enable caching for faster queries, sets us up for upserts later on.
How
Every Postgres database now creates a DuckDB database. Connections are opened against this database.
Tests
Now that the DuckDB database is on disk, how does this work with multiple Postgres connection processes accessing it simultaneously? My impression is that DuckDB needs to be in read-only mode to support multiple connections:
https://github.com/duckdb/duckdb/issues/1343 https://github.com/duckdb/duckdb/issues/40
If you feel like it's appropriate to add a test for this, you can make multiple instances of PgConnection in the test suite and try running queries with both.
Now that the DuckDB database is on disk, how does this work with multiple Postgres connection processes accessing it simultaneously? My impression is that DuckDB needs to be in read-only mode to support multiple connections:
duckdb/duckdb#1343 duckdb/duckdb#40
If you feel like it's appropriate to add a test for this, you can make multiple instances of PgConnection in the test suite and try running queries with both.
Great catch. I had missed this caveat but I now see it https://duckdb.org/docs/connect/concurrency.
Do you think a background worker would solve?