pg_duckdb icon indicating copy to clipboard operation
pg_duckdb copied to clipboard

Local object cache for cached_httpfs extension

Open mkaruza opened this issue 1 year ago • 8 comments

  • To cache remote object, user need to explicilty use duckdb.cache(path, type) function. Path is remote HTTPFS/S3/GCS/R2 object path and type is either parquet or csv indicating remote object type. Caching will not be triggered on normal SELECT queries.

  • When file is currently being cached and if we run in parallel another query targeting same remote object - second query will fail because it will not be able to take read lock.

  • PG_DATADIR/duckdb_cache will be used as cache directory

mkaruza avatar Aug 02 '24 17:08 mkaruza

FYI, if first connecton is caching remote file:

postgres=# select duckdb.cache('https://datasets.clickhouse.com/hits_compatible/hits.parquet');

Until this file is downloaded all other queries on same remote file will fail as:

postgres=# select count(*) from read_parquet('https://datasets.clickhouse.com/hits_compatible/hits.parquet') AS (v INT);
WARNING:  (DuckDB) IO Error: Could not set lock on file "/tmp/871ec48c8679b179e848071df4cfc5ef": Conflicting lock is held in /work/hydra/postgres/src/backend/postgres (PID 22174) by user mkaruza. See also https://duckdb.org/docs/connect/concurrency
ERROR:  Function `read_parquet(TEXT)` only works with Duckdb execution.
CONTEXT:  PL/pgSQL function read_parquet(text) line 3 at RAISE

mkaruza avatar Aug 02 '24 17:08 mkaruza

This seems useful in general, and not specific to pg_duck. How about doing this in the main duckdb repo, or as a separate extension?

hlinnaka avatar Aug 04 '24 08:08 hlinnaka

Until this file is downloaded all other queries on same remote file will fail

I understand the premise here, but ideally the query would detect the lock and then continue but access the file "directly" as if the caching wasn't present.

Another thing to consider: what if the original cache call is cancelled or otherwise interrupted? Aside from the lock, how does the cache ensure the file it finds is the complete file?

wuputah avatar Aug 04 '24 16:08 wuputah

The test fails because we changed how replacement scan names are assigned, ReplacementScanInput now has:

	const string &catalog_name;
	const string &schema_name;
	const string &table_name;

To get back the old behavior, you can use auto table_name = ReplacementScan::GetFullPath(input);

Tishj avatar Aug 05 '24 07:08 Tishj

I'd like to see this PR get split in 3:

  • Upgrade to newer duckdb version, apply the necessary changes to fix breaking tests
  • Copy HTTPFS extension to third_party/cached_httpfs without modifications, add the necessary changes to pg_duckdb for it
  • Add CachedFile and friends to the CachedHttpfsExtension, add additional changes to pg_duckdb relevant for this caching mechanism

Tishj avatar Aug 05 '24 08:08 Tishj

This seems useful in general, and not specific to pg_duck. How about doing this in the main duckdb repo, or as a separate extension?

Yeah, I think it makes much more sense to include this as a feature in the main httpfs extension. As opposed to maintaining a fork of that extension in this repo.

JelteF avatar Aug 05 '24 09:08 JelteF

Typically duckdb does not need this type of caching because when you use duckdb, you have access to the local filesystem. We can consider contributing it back upstream at some point but the decision was made to work on this here since it's more relevant to this use case.

Apply patches to that copied source to add the caching modifications

I get the point of being able to have the original code so we can diff it. Keeping this up to date with the upstream code is definitely a concern, but working with patches doesn't sound like fun either.

wuputah avatar Aug 05 '24 15:08 wuputah

Typically duckdb does not need this type of caching because when you use duckdb, you have access to the local filesystem.

I don't understand this argument (but I might be missing something). Afaict the data requested by httpfs is by definition not on the local filesystem (unless you use httpfs to connect to some webserver on localhost ofcourse). So providing a way to cache remote files onto the local filesystem seems pretty useful even for plain duckdb users of httpfs.

JelteF avatar Aug 05 '24 15:08 JelteF