Support for set returning functions that don't exist in DuckDB
What happens?
When "Set Returning Function" as a FORM clause. How can we distinguish between DuckDB execution and PG execution? I can't think of a solution yet.
test=# explain SELECT * FROM generate_series(1, 3); -- ok
QUERY PLAN
------------------------------------------------------------
Custom Scan (DuckDBScan) (cost=0.00..0.00 rows=0 width=0)
DuckDB Execution Plan:
┌───────────────────────────┐
│ GENERATE_SERIES │
│ ──────────────────── │
│ Function: │
│ GENERATE_SERIES │
│ │
│ ~2 Rows │
└───────────────────────────┘
(13 rows)
test=# SELECT * FROM generate_series(1, 3);
generate_series
-----------------
1
2
3
(3 rows)
test=# SELECT * FROM duckdb.cache_info();
2024-12-06 23:58:25.193 CST [689272] WARNING: 01000: (PGDuckDB/CreatePlan) Prepared query returned an error: 'Catalog Error: Table Function with name cache_info does not exist!
Did you mean "main.checkpoint"?
LINE 1: ...e_file_size, cache_file_timestamp FROM duckdb.cache_info() cache_info(remote_p...
^
2024-12-06 23:58:25.193 CST [689272] LOCATION: CreatePlan, pgduckdb_planner.cpp:64
WARNING: (PGDuckDB/CreatePlan) Prepared query returned an error: 'Catalog Error: Table Function with name cache_info does not exist!
Did you mean "main.checkpoint"?
LINE 1: ...e_file_size, cache_file_timestamp FROM duckdb.cache_info() cache_info(remote_p...
^
remote_path | cache_key | cache_file_size | cache_file_timestamp
-------------+-----------+-----------------+----------------------
(0 rows)
test=# \df duckdb.cache_info
List of functions
Schema | Name | Result data type | Argument data types | Type
--------+------------+-------------------------+---------------------+------
duckdb | cache_info | SETOF duckdb.cache_info | | func
(1 row)
test=# \df generate_series
List of functions
Schema | Name | Result data type | Argument data types | Type
------------+-----------------+-----------------------------------+--------------------------------------------------------------------+------
pg_catalog | generate_series | SETOF bigint | bigint, bigint | func
pg_catalog | generate_series | SETOF bigint | bigint, bigint, bigint | func
pg_catalog | generate_series | SETOF integer | integer, integer | func
pg_catalog | generate_series | SETOF integer | integer, integer, integer | func
pg_catalog | generate_series | SETOF numeric | numeric, numeric | func
pg_catalog | generate_series | SETOF numeric | numeric, numeric, numeric | func
pg_catalog | generate_series | SETOF timestamp without time zone | timestamp without time zone, timestamp without time zone, interval | func
pg_catalog | generate_series | SETOF timestamp with time zone | timestamp with time zone, timestamp with time zone, interval | func
pg_catalog | generate_series | SETOF timestamp with time zone | timestamp with time zone, timestamp with time zone, interval, text | func
(9 rows)
To Reproduce
SELECT * FROM duckdb.cache_info(); -- error
SELECT * FROM generate_series(1, 3); -- ok
OS:
centos8
pg_duckdb Version (if built from source use commit hash):
main
Postgres Version (if built from source use commit hash):
16.6
Hardware:
No response
Full Name:
Man Zeng
Affiliation:
Halo
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have not tested with any build
Did you include all relevant data sets for reproducing the issue?
No - Other reason (please specify in the issue body)
Did you include all code required to reproduce the issue?
- [ ] Yes, I have
Did you include all relevant configuration (e.g., CPU architecture, Linux distribution) to reproduce the issue?
- [ ] Yes, I have
HI, thanks for report. This is not a bug.
Postgres custom function duckdb.cache_info() doesn't exists in duckdb context so execution will fallback to postgres.
Currently we don't have support for custom postgres function that could be used inside duckdb execution (but this is one of features that we would like to have in future).
For example, consider this example that produce same issue
postgres=# CREATE FUNCTION dummy_func() RETURNS VOID AS $$
postgres$# BEGIN
postgres$# RETURN;
postgres$# END;
postgres$# $$ LANGUAGE plpgsql;
CREATE FUNCTION
postgres=# SELECT * FROM dummy_func();
dummy_func
------------
(1 row)
postgres=# SET duckdb.force_execution TO true;
SET
postgres=# SELECT * FROM dummy_func();
WARNING: (PGDuckDB/CreatePlan) Prepared query returned an error: 'Catalog Error: Table Function with name dummy_func does not exist!
Did you mean "summary"?
LINE 1: SELECT dummy_func FROM dummy_func() dummy_func(dummy_func)
^
dummy_func
------------
(1 row)
postgres=#
To address your question - how execution can be distinguish between pg or duckdb, for now, EXPLAIN can give you indication where it will be executed
postgres=# EXPLAIN SELECT * FROM dummy_func();
WARNING: (PGDuckDB/CreatePlan) Prepared query returned an error: 'Catalog Error: Table Function with name dummy_func does not exist!
Did you mean "summary"?
LINE 1: EXPLAIN SELECT dummy_func FROM dummy_func() dummy_func(dummy_func)
^
QUERY PLAN
---------------------------------------------------------------
Function Scan on dummy_func (cost=0.25..0.26 rows=1 width=4)
(1 row)
postgres=#
嗨,感谢您的举报。这不是一个错误。 Postgres 自定义函数在 duckdb 上下文中不存在,因此执行将回退到 postgres。
duckdb.cache_info()目前,我们不支持可以在 duckdb 执行中使用的自定义 postgres 函数(但这是我们将来希望拥有的功能之一)。
例如,考虑这个生成相同的
issuepostgres=# CREATE FUNCTION dummy_func() RETURNS VOID AS $$ postgres$# BEGIN postgres$# RETURN; postgres$# END; postgres$# $$ LANGUAGE plpgsql; CREATE FUNCTION postgres=# SELECT * FROM dummy_func(); dummy_func ------------ (1 row) postgres=# SET duckdb.force_execution TO true; SET postgres=# SELECT * FROM dummy_func(); WARNING: (PGDuckDB/CreatePlan) Prepared query returned an error: 'Catalog Error: Table Function with name dummy_func does not exist! Did you mean "summary"? LINE 1: SELECT dummy_func FROM dummy_func() dummy_func(dummy_func) ^ dummy_func ------------ (1 row) postgres=#为了回答您的问题 - 目前如何区分 pg 或 duckdb 之间的执行可以为您提供指示它将在何处执行
EXPLAINpostgres=# EXPLAIN SELECT * FROM dummy_func(); WARNING: (PGDuckDB/CreatePlan) Prepared query returned an error: 'Catalog Error: Table Function with name dummy_func does not exist! Did you mean "summary"? LINE 1: EXPLAIN SELECT dummy_func FROM dummy_func() dummy_func(dummy_func) ^ QUERY PLAN --------------------------------------------------------------- Function Scan on dummy_func (cost=0.25..0.26 rows=1 width=4) (1 row) postgres=#
Yes, I know what you mean. I was trying to get rid of the warning. Since functions like generate_series have both pg and duckdb, duckdb can execute them. But a function like duckdb.cache_info is private to pg, so it cannot be executed in duckdb to report a warning and then hand it over to pg. This makes it difficult to tell when functions like these should be handled by duckdb. I want to ask you to see if there is a good way。
Or when the function is in the from clause, we don't send it to duckdb for execution, Is that OK with you?
I think this is something that we'll eventually want to support. But I think it should be built on top of the logic that we're implementing in #477. In #477 we push reading from postgres tables back down to postgres from duckdb. We should in theory be able to do the same for functions afaict. If the function exists in both postgres and DuckDB there's a choice we need to make though, if we push it down to PG or we use the DuckDB implementation.
I'll re-open this issue to track this long term feature. But I don't expect we will start on this soon. First we need #477 merged before someone can even start on this.