postgres_scanner icon indicating copy to clipboard operation
postgres_scanner copied to clipboard

Postgres Writer dropping data - pushing incomplete data from parquet

Open arpit94 opened this issue 7 months ago • 0 comments

What happens?

I have a simple parquet file with two columns (types - bigint and varchar[] -in postgres, INT64 and BYTE_ARRAY in parquet)

When I try to write the data to postgres using the postgres connector, there is data loss happening and not all of the data is making it to postgres. I am able to successfully able to query the parquet in duckdb itself. (Even the csv export works well)

To Reproduce

ATTACH 'dbname=<dbname> port=<port> user=<user> host=<host> password=<pass>' AS db (TYPE POSTGRES);
SELECT * FROM 'https://github.com/arpit94/duckdb/raw/main/data/parquet-testing/npi.parquet' where npi = 1003000126;
┌────────────┬────────────────────┐
│    npi     │ primary_taxo_codes │
│   int64    │     varchar[]      │
├────────────┼────────────────────┤
│ 1003000126 │ [207R00000X]       │
└────────────┴────────────────────┘
CREATE OR REPLACE TABLE db.public.my_table as FROM 'https://github.com/arpit94/duckdb/raw/main/data/parquet-testing/npi.parquet';
SELECT * FROM db.public.my_table where npi = 1003000126;
┌────────────┬────────────────────┐
│    npi     │ primary_taxo_codes │
│   int64    │     varchar[]      │
├────────────┼────────────────────┤
│ 1003000126 │                    │
└────────────┴────────────────────┘

The same thing works with csv format

COPY (SELECT * FROM 'https://github.com/arpit94/duckdb/raw/main/data/parquet-testing/npi.parquet') TO 'output.csv' (HEADER, DELIMITER ',');
SELECT * FROM 'output.csv' WHERE npi = 1003000126;
┌────────────┬────────────────────┐
│    npi     │ primary_taxo_codes │
│   int64    │      varchar       │
├────────────┼────────────────────┤
│ 1003000126 │ [207R00000X]       │
└────────────┴────────────────────┘

OS:

Ubuntu

DuckDB Version:

1.0.0

DuckDB Client:

CLI tool

Full Name:

Arpit Aggarwal

Affiliation:

Candor Health

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • [X] Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • [X] Yes, I have

arpit94 avatar Jul 10 '24 12:07 arpit94