postgres_scanner
postgres_scanner copied to clipboard
when write back to GP locked by Postgres scanner
What happens?
I am writing functions in PL/Python in Greenplum (GP) with three main steps:
- Using DuckDB to read GP tables.
- Calculating a result table, "rst," in DuckDB, extracting the DDL of the table, and creating the table in GP (need to add distribution keys, compression, etc., which default CREATE does not include,use plpy.execute(ddl)).
tb="gto_skc_store_step_kpi_summary_duck"
sql="""select sql from duckdb_tables() where table_name like 'tmp_rst' and schema_name = 'main';"""
ddl=dd.execute(sql).fetchone()[0]
ddl=f"""drop table if exists tenant_{tenant}_adm.{tb};
"""+ddl.replace('CREATE TABLE tmp_rst', f'CREATE TABLE tenant_{tenant}_adm.{tb}') \
.replace('DOUBLE', 'numeric').replace(';',"""WITH (
appendonly=true,
orientation=column,
compresslevel=6,
compresstype=zlib
)
DISTRIBUTED BY (skc_sk);""")
plpy.notice(ddl)
plpy.execute(ddl)
- Writing the "rst" result table to GP (using COPY or INSERT).
insert into gp.tenant_{tenant}_adm.{tb}
select * from tmp_rst
-- or
COPY gp.tenant_{tenant}_adm.gto_skc_store_step_kpi_summary_duck FROM '/home/gpadmin/{tb}.parquet';
The third step never completes, and checking pg_stat_activity shows a locked state.
However, if split it into two functions, the first completes the first two steps, and write the result to a Parquet file, and the second reads the Parquet file, writing to GP, finishing in just tens of seconds.
Why?
pg_stat_activity:
and pg_locks in following excel:
pg_locks.xlsx
To Reproduce
import duckdb as dd
sql=f"""COPY gp.tenant_{tenant}_adm.gto_skc_store_step_kpi_summary_duck FROM '/home/gpadmin/{tb}.parquet';"""
dd.execute(sql)
OS:
centos7
DuckDB Version:
0.10.1
DuckDB Client:
python3.9
Full Name:
wang cz
Affiliation:
Linezone
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a stable release
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include all code required to reproduce the issue?
- [X] Yes, I have
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
- [X] Yes, I have
Hi @wonb168, thanks for opening this issue. Could you please make the script self-contained with imports?
I write a minest demo:
-- create extension plpython3u;
CREATE OR REPLACE FUNCTION test_ddl() RETURNS text AS $$
import duckdb
def exesql(sql):
plpy.notice(sql)
dd.execute(sql)
dd=duckdb.connect('gp.duckdb')
dd=duckdb.connect()
dbname="mdmaster_hggp7_dev"
tenant=dbname.split('_')[1]
dburl=f"dbname={dbname} user=gpadmin host=127.0.0.1 port=2345"
sql=f"load postgres;ATTACH '{dburl}' AS gp (TYPE postgres);"
exesql(sql)
sql="drop table if exists tmp_rst;create table tmp_rst(id int,name text);insert into tmp_rst values (1,'a');"
exesql(sql)
sql="""select sql from duckdb_tables() where table_name like 'tmp_rst' and schema_name = 'main';"""
sql="drop table if exists public.tmp_rst;"+(dd.execute(sql).fetchone()[0]).replace('tmp_rst','public.tmp_rst')
plpy.notice(sql)
plpy.execute(sql)
sql=f"detach gp;ATTACH '{dburl}' AS gp (TYPE postgres);"
#sql="CALL pg_clear_cache();"
#sql="call postgres_execute(sql)" #Table Function with name postgres_execute does not exist!
exesql(sql)
sql="insert into gp.public.tmp_rst select * from tmp_rst"
exesql(sql)
return 'done'
$$ LANGUAGE plpython3u;
select test_ddl();
It raise error, no table. I guess duck got the gp tables at the ATTACH moment, it can't know the new table, then can't exe insert sql, but how to get the newest table from gp, I test CALL pg_clear_cache() or re attatch , both NOT WORK!