pg_analytics icon indicating copy to clipboard operation
pg_analytics copied to clipboard

feat: Implement caching

Open rebasedming opened this issue 1 year ago • 2 comments

Ticket(s) Closed

  • Closes #

What

CREATE FOREIGN TABLE now has a cache option. If set to true, a DuckDB table instead of view is created. This stores the table in DuckDB format inside the Postgres data directory.

CREATE FOREIGN DATA WRAPPER parquet_wrapper
HANDLER parquet_fdw_handler VALIDATOR parquet_fdw_validator;

-- Provide S3 credentials
CREATE SERVER parquet_server FOREIGN DATA WRAPPER parquet_wrapper;

-- Create foreign table with auto schema creation
CREATE FOREIGN TABLE trips ()
SERVER parquet_server
OPTIONS (files 's3://paradedb-benchmarks/yellow_tripdata_2024-01.parquet', cache 'true');

-- Success! Now you can query the remote Parquet file like a regular Postgres table
SELECT COUNT(*) FROM trips;

Why

Enable caching for faster queries, sets us up for upserts later on.

How

Every Postgres database now creates a DuckDB database. Connections are opened against this database.

Tests

rebasedming avatar Jul 31 '24 19:07 rebasedming

Now that the DuckDB database is on disk, how does this work with multiple Postgres connection processes accessing it simultaneously? My impression is that DuckDB needs to be in read-only mode to support multiple connections:

https://github.com/duckdb/duckdb/issues/1343 https://github.com/duckdb/duckdb/issues/40

If you feel like it's appropriate to add a test for this, you can make multiple instances of PgConnection in the test suite and try running queries with both.

neilyio avatar Jul 31 '24 19:07 neilyio

Now that the DuckDB database is on disk, how does this work with multiple Postgres connection processes accessing it simultaneously? My impression is that DuckDB needs to be in read-only mode to support multiple connections:

duckdb/duckdb#1343 duckdb/duckdb#40

If you feel like it's appropriate to add a test for this, you can make multiple instances of PgConnection in the test suite and try running queries with both.

Great catch. I had missed this caveat but I now see it https://duckdb.org/docs/connect/concurrency.

Do you think a background worker would solve?

rebasedming avatar Jul 31 '24 20:07 rebasedming