pg_duckdb icon indicating copy to clipboard operation
pg_duckdb copied to clipboard

Vendor in Postgres ruleutils and handle public.read_csv

Open JelteF opened this issue 1 year ago • 1 comments

This PR contains multiple commits. These are not intended to be squashed into a single commit when merging, to keep it clear what code came from Postgres and what code we added on top. See the commit messages of these commits for details. But the main changes are:

  1. Vendor in ruleutils.c from Postgres, so we can modify its pg_get_querydef function to generate queries that are valid in DuckDB.
  2. Use these new vendored files to support the following query which fails on main:
set search_path = '';
SELECT * from public.read_csv('https://raw.githubusercontent.com/duckdb/duckdb/main/data/csv/real/web_page.csv') as (column00 int)`

Obviously item 2 is not the main reason to vendor in 10k+ lines of code (for each Postgres version). The main reason is so we are have the ability in future PRs to modify the query slightly before sending it to DuckDB. One such things is to be able to fully qualify the query before sending that query to DuckDB. This is important to solve all kinds of search_path issues like described in #43 because of differences between Postgres and DuckDB its search_path handling. The public.read_csv search_path confusion is just an example of how we can use this to solve such issues.

Note that this is not the only way to solve search_path issues. By introducing a special pgduckdb catalog in DuckDB, like #97 does, we could solve many of the search_path issues on the DuckDB side too. But I think quite a few can be solved solved easier on the Postgres side. Primarily because we want to DuckDB to adhere to the way Postgres did the binding, and the easiest way to do that is by having Postgres tell DuckDB exactly how it resolved the query.

This PR also does not obsolete #97, it is actually meant to work well in combination with it. Because using this PR we can easily generate fully qualified table names including the pgduckdb catalog. So SELECT count(*) from mytable would become SELECT count(*) AS count FROM pgduckdb.public.mytable.

JelteF avatar Aug 20 '24 14:08 JelteF

As described in another design document eventually I want to add support for multiple databases (after we support a single MotherDuck database). That would require us to translate schema names from postgres to database+schema in DuckDB, which again could be easily handled at the point where we're creating the DuckDB query.

JelteF avatar Aug 20 '24 15:08 JelteF