pg_duckdb
pg_duckdb copied to clipboard
Case insensitivity mismatch between duckdb and postgres can cause issues
DuckDB considers identifiers with different casing the same even if they are quoted.
The most minimal example of this is the following schema:
CREATE TABLE a(b int);
CREATE TABLE "A"(c int);
INSERT INTO a VALUES(1);
When using postgres execution this correctly outputs:
> SELECT * FROM a, "A" as aa;
b │ c
───┼───
(0 rows)
But when enabling duckdb execution both a and "A" are interpreted as a. Resulting in the following incorrect result:
> SELECT * FROM a, "A" as aa;
b │ b
───┼───
1 │ 1
(1 row)
This happens even when I set preserver_identifier_case to true on the DuckDB connection.
Good find. Some related discussion here: https://github.com/duckdb/pg_duckdb/issues/56#issuecomment-2270785294, and related to #43 -- though these are not specific to case insensitivity.
This may be related, but in either of the two below, the field name cases aren't maintained in the new postgres table. Using column names in double quotes instead of a wildcard also does not resolve the issue
CREATE TABLE postgres_staging.public.{table_name} AS SELECT * FROM postgres_prod.public.{table_name}
duckdb.sql(f'CREATE TABLE {table_name} AS SELECT * FROM postgres_prod.public.{table_name}')
duckdb.sql(f'CREATE TABLE postgres_staging.public.{table_name} AS SELECT * FROM {table_name}')
This shouldn't be in the long term milestone. but an important bug to fix.
@pantonis we'll try to address this asap.
Based on some research I did this should be quite easy to address, postgres_scanner is able to do this. I'm pretty sure it's a few lines of code only. So we can use the same approach as is used there. Let's wait until #477 is merged though, to avoid any merge conflicts.
We had the same issue when we were using Hydra and it was fixed quite fast I can say.
This seems to be solved in the current version.
We had the same issue when we were using Hydra and it was fixed quite fast I can say.
If it's indeed resolved we should add some tests to make sure we don't regress again.
Not sure if it's related, but it seems you can't change the casing of the names with as if the name is upper when reading from parquet.
create a parquet, like
duckdb -c "COPY (select generate_series as \"UPPER\", generate_series as \"lower\" FROM generate_series(100)) TO '/tmp/tmp.parquet' (FORMAT 'parquet')"
then from the postgresql
select * from read_parquet('/tmp/tmp.parquet') as t(upper int8, lower int8) limit 2;
UPPER | lower
-------+-------
0 | 0
1 | 1
(2 rows)
select upper as "upper", lower as "LOWER" from read_parquet('/tmp/tmp.parquet') as t(upper int8, lower int8) limit 2;
UPPER | LOWER
-------+-------
0 | 0
1 | 1
(2 rows)
select upper as "upper_", lower as "LOWER_" from read_parquet('/tmp/tmp.parquet') as t(upper int8, lower int8) limit 2;
upper_ | LOWER_
--------+--------
0 | 0
1 | 1
(2 rows)
can't rename UPPER to lowercase. If you change the name it works fine.
interestingly it only happens UPPERCASE to lowercase, not the opposite as shown in the lower column
let me know if I should open a new issue
@CaselIT can you try that new again with the current main branch? #531 might very well change the behavior you're describing.
just tried on docker.io/pgduckdb/pgduckdb:17-main and I get the same result. I originally were using 0.2
Alright, can you create a separate issue for this?
@JelteF ok, I've crated #564