postgres_scanner when write back to GP locked by Postgres scanner

What happens?

I am writing functions in PL/Python in Greenplum (GP) with three main steps:

Using DuckDB to read GP tables.
Calculating a result table, "rst," in DuckDB, extracting the DDL of the table, and creating the table in GP (need to add distribution keys, compression, etc., which default CREATE does not include，use plpy.execute(ddl)).

tb="gto_skc_store_step_kpi_summary_duck"
sql="""select sql from duckdb_tables() where table_name like 'tmp_rst' and schema_name = 'main';"""
ddl=dd.execute(sql).fetchone()[0]
ddl=f"""drop table if exists tenant_{tenant}_adm.{tb};
"""+ddl.replace('CREATE TABLE tmp_rst', f'CREATE TABLE tenant_{tenant}_adm.{tb}') \
	.replace('DOUBLE', 'numeric').replace(';',"""WITH (
	appendonly=true,
	orientation=column,
	compresslevel=6,
	compresstype=zlib
)
DISTRIBUTED BY (skc_sk);""")

plpy.notice(ddl)
plpy.execute(ddl)

Writing the "rst" result table to GP (using COPY or INSERT).

insert into gp.tenant_{tenant}_adm.{tb}
select * from tmp_rst
-- or
COPY gp.tenant_{tenant}_adm.gto_skc_store_step_kpi_summary_duck FROM '/home/gpadmin/{tb}.parquet';

The third step never completes, and checking pg_stat_activity shows a locked state. However, if split it into two functions, the first completes the first two steps, and write the result to a Parquet file, and the second reads the Parquet file, writing to GP, finishing in just tens of seconds. Why？ pg_stat_activity: and pg_locks in following excel: pg_locks.xlsx

To Reproduce

import duckdb as dd 
sql=f"""COPY gp.tenant_{tenant}_adm.gto_skc_store_step_kpi_summary_duck FROM '/home/gpadmin/{tb}.parquet';"""
dd.execute(sql)

OS:

centos7

DuckDB Version:

0.10.1

DuckDB Client:

python3.9

Full Name:

wang cz

Affiliation:

Linezone

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

[X] Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

[X] Yes, I have

Apr 10 '24 08:04 wonb168

Hi @wonb168, thanks for opening this issue. Could you please make the script self-contained with imports?

Apr 10 '24 09:04 szarnyasg

I write a minest demo:

-- create extension plpython3u;
CREATE OR REPLACE FUNCTION test_ddl() RETURNS text AS $$ 
import duckdb 
def exesql(sql):
	plpy.notice(sql)
	dd.execute(sql)

dd=duckdb.connect('gp.duckdb')
dd=duckdb.connect()
dbname="mdmaster_hggp7_dev"	
tenant=dbname.split('_')[1]
dburl=f"dbname={dbname} user=gpadmin host=127.0.0.1 port=2345"
sql=f"load postgres;ATTACH '{dburl}' AS gp (TYPE postgres);"
exesql(sql)
sql="drop table if exists tmp_rst;create table tmp_rst(id int,name text);insert into tmp_rst values (1,'a');"
exesql(sql)
sql="""select sql from duckdb_tables() where table_name like 'tmp_rst' and schema_name = 'main';"""
sql="drop table if exists public.tmp_rst;"+(dd.execute(sql).fetchone()[0]).replace('tmp_rst','public.tmp_rst') 
plpy.notice(sql)
plpy.execute(sql) 

sql=f"detach gp;ATTACH '{dburl}' AS gp (TYPE postgres);"
#sql="CALL pg_clear_cache();"
#sql="call postgres_execute(sql)" #Table Function with name postgres_execute does not exist!
exesql(sql)

sql="insert into gp.public.tmp_rst select * from tmp_rst"
exesql(sql)

return 'done' 
$$ LANGUAGE plpython3u;
select test_ddl();

It raise error, no table. I guess duck got the gp tables at the ATTACH moment, it can't know the new table, then can't exe insert sql, but how to get the newest table from gp, I test CALL pg_clear_cache() or re attatch , both NOT WORK!

Apr 10 '24 10:04 wonb168

postgres_scanner postgres_scanner copied to clipboard

when write back to GP locked by Postgres scanner

What happens?

To Reproduce

OS:

DuckDB Version:

DuckDB Client:

Full Name:

Affiliation:

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

Did you include all relevant data sets for reproducing the issue?

Did you include all code required to reproduce the issue?

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

postgres_scanner
postgres_scanner copied to clipboard