PostgreSQL driver mangles duplicate columns
Expected behavior and actual behavior.
At least with the PostgreSQL/PostGIS driver, ogr2ogr and ogrinfo don't seem to play well with duplicate source column names.
Steps to reproduce the problem.
First, create a PostGIS database and two simple tables:
# create table t1(id int not null primary key, geom geometry not null);
CREATE TABLE
# insert into t1 values (1, 'point (10 10)'), (2, 'point (20 20)'), (3, 'point (30 30)');
INSERT 0 3
# create table t2(id int not null references t1(id), attr int not null);
CREATE TABLE
# insert into t2 values (1, 100), (2, 200), (3, 300);
INSERT 0 3
Then try to query them using ogrinfo:
$ docker run --rm -v /etc/passwd:/etc/passwd -v /etc/group:/etc/group -v /var/run/postgresql:/var/run/postgresql -u $(id -u):$(id -g) osgeo/gdal:ubuntu-full-3.3.1 ogrinfo -sql 'select t1.*, t2.* from t1 inner join t2 using (id)' 'PG:dbname=mydb'
INFO: Open of `PG:dbname=mydb'
using driver `PostgreSQL' successful.
Layer name: sql_statement
Geometry: Unknown (any)
Feature Count: 3
Extent: (10.000000, 10.000000) - (30.000000, 30.000000)
Layer SRS WKT:
(unknown)
Geometry Column NOT NULL = geom
id: Integer (0.0) NOT NULL
id: Integer (0.0)
attr: Integer (0.0) NOT NULL
OGRFeature(sql_statement):0
id (Integer) = 1
attr (Integer) = 100
POINT (10 10)
OGRFeature(sql_statement):1
id (Integer) = 2
attr (Integer) = 200
POINT (20 20)
OGRFeature(sql_statement):2
id (Integer) = 3
attr (Integer) = 300
POINT (30 30)
Notice how the second id column shows up as nullable, even though it's not. But that doesn't matter, let's proceed and format the query more nicely (there's an extra newline at the start):
$ docker run --rm -v /etc/passwd:/etc/passwd -v /etc/group:/etc/group -v /var/run/postgresql:/var/run/postgresql -u $(id -u):$(id -g) osgeo/gdal:ubuntu-full-3.3.1 ogrinfo -sql '
> select t1.*, t2.* from t1 inner join t2 using (id)' 'PG:dbname=mydb'
INFO: Open of `PG:dbname=mydb'
using driver `PostgreSQL' successful.
Layer name: sql_statement
Geometry: Unknown (any)
Feature Count: 3
Extent: (10.000000, 10.000000) - (30.000000, 30.000000)
Layer SRS WKT:
(unknown)
id: Integer (0.0)
attr: Integer (0.0)
OGRFeature(sql_statement):1
attr (Integer) = 100
POINT (10 10)
OGRFeature(sql_statement):2
attr (Integer) = 200
POINT (20 20)
OGRFeature(sql_statement):3
attr (Integer) = 300
POINT (30 30)
Now we only have one id column, but:
- the
geomcolumn no longer shows up in the column list at the top idis still nullable- all returned
ids areNULL, so they're not listed in the features - I didn't set a
SRIDon the geometries, but it would be gone if I had
Operating system
N/A (Docker)
GDAL version and provenance
Happens at least on 3.3.1 (official Docker image) and 2.3.2.
One id is used as a feature ID (FID) that is handled with special rules. You can adjust your SQL for example like
-sql "select t1.*, t1.id as id_1, t2.*, t2.id as id_2 from t1 inner join t2 using (id)"
and get result
Layer name: sql_statement
Geometry: Unknown (any)
Feature Count: 3
Extent: (10.000000, 10.000000) - (30.000000, 30.000000)
Layer SRS WKT:
(unknown)
Geometry Column NOT NULL = geom
id: Integer (0.0)
id_1: Integer (0.0) NOT NULL
id: Integer (0.0)
attr: Integer (0.0) NOT NULL
id_2: Integer (0.0) NOT NULL
OGRFeature(sql_statement):0
id (Integer) = 1
id_1 (Integer) = 1
attr (Integer) = 100
id_2 (Integer) = 1
POINT (10 10)
I suppose that by finetuning the SQL you can get a result that makes you happy.
The only solution is to avoid the duplicate column names. Even with your variant of the query, the extra newline makes a difference in the behavior, not only in the number of non-NULL columns, but also in the returned SRS:
$ docker run --rm -v /etc/passwd:/etc/passwd -v /etc/group:/etc/group -v /var/run/postgresql:/var/run/postgresql -u $(id -u):$(id -g) osgeo/gdal:ubuntu-full-3.3.1 ogrinfo -sql 'select t1.*, t1.id as id_1, t2.*, t2.id as id_2 from t1 inner join t2 using (id)' 'PG:dbname=mydb'
INFO: Open of `PG:dbname=mydb'
using driver `PostgreSQL' successful.
Layer name: sql_statement
Geometry: Unknown (any)
Feature Count: 3
Extent: (10.000000, 10.000000) - (30.000000, 30.000000)
Layer SRS WKT:
GEOGCRS["WGS 84",
ENSEMBLE["World Geodetic System 1984 ensemble",
MEMBER["World Geodetic System 1984 (Transit)"],
MEMBER["World Geodetic System 1984 (G730)"],
MEMBER["World Geodetic System 1984 (G873)"],
MEMBER["World Geodetic System 1984 (G1150)"],
MEMBER["World Geodetic System 1984 (G1674)"],
MEMBER["World Geodetic System 1984 (G1762)"],
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]],
ENSEMBLEACCURACY[2.0]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
CS[ellipsoidal,2],
AXIS["geodetic latitude (Lat)",north,
ORDER[1],
ANGLEUNIT["degree",0.0174532925199433]],
AXIS["geodetic longitude (Lon)",east,
ORDER[2],
ANGLEUNIT["degree",0.0174532925199433]],
USAGE[
SCOPE["Horizontal component of 3D system."],
AREA["World."],
BBOX[-90,-180,90,180]],
ID["EPSG",4326]]
Data axis to CRS axis mapping: 2,1
Geometry Column NOT NULL = geom
id: Integer (0.0)
id_1: Integer (0.0) NOT NULL
id: Integer (0.0)
attr: Integer (0.0) NOT NULL
id_2: Integer (0.0) NOT NULL
OGRFeature(sql_statement):0
id (Integer) = 1
id_1 (Integer) = 1
attr (Integer) = 100
id_2 (Integer) = 1
POINT (10 10)
OGRFeature(sql_statement):1
id (Integer) = 2
id_1 (Integer) = 2
attr (Integer) = 200
id_2 (Integer) = 2
POINT (20 20)
OGRFeature(sql_statement):2
id (Integer) = 3
id_1 (Integer) = 3
attr (Integer) = 300
id_2 (Integer) = 3
POINT (30 30)
$ docker run --rm -v /etc/passwd:/etc/passwd -v /etc/group:/etc/group -v /var/run/postgresql:/var/run/postgresql -u $(id -u):$(id -g) osgeo/gdal:ubuntu-full-3.3.1 ogrinfo -sql '
select t1.*, t1.id as id_1, t2.*, t2.id as id_2 from t1 inner join t2 using (id)' 'PG:dbname=mydb'
INFO: Open of `PG:dbname=mydb'
using driver `PostgreSQL' successful.
Layer name: sql_statement
Geometry: Unknown (any)
Feature Count: 3
Extent: (10.000000, 10.000000) - (30.000000, 30.000000)
Layer SRS WKT:
(unknown)
id: Integer (0.0)
id_1: Integer (0.0)
attr: Integer (0.0)
id_2: Integer (0.0)
OGRFeature(sql_statement):1
id_1 (Integer) = 1
attr (Integer) = 100
id_2 (Integer) = 1
POINT (10 10)
OGRFeature(sql_statement):2
id_1 (Integer) = 2
attr (Integer) = 200
id_2 (Integer) = 2
POINT (20 20)
OGRFeature(sql_statement):3
id_1 (Integer) = 3
attr (Integer) = 300
id_2 (Integer) = 3
POINT (30 30)
Sorry, I can't test extra newline on Windows. Is it somehow essential for you to use that?
It's not essential and you can avoid all this trouble if you don't have duplicate column names, but it's a bug: Postgres ignores whitespace and getting different behavior when changing that is quite surprising and can hide the root cause (the column names).
I think that GDAL data model does not support duplicate column names. By running ogr2ogr instead of ogrinfo I can see a message
Warning 1: Field 'id' already exists. Renaming it as 'id2'
The behavior with extra line and that it affects how SRS is reported is indeed odd. But maybe whatever happens does not happen at PostgreSQL level. Have you tried to use SQL with extra lines by using the syntax -sql @filename syntax? Perhaps the problem is in using unescaped linefeed in the command line?
What do you suggest as a fix? Should ogrinfo throw a warning about duplicate field names and rename another field automatically like ogr2ogr does? As a comparison, PSQL reports duplicate fields with SELECT but in another context they do not work
# create table tt_test as select t1.*, t2.* from t1 inner join t2 using (id);
ERROR: column "id" specified more than once
Here ogr2ogr is more user friendly than PostgreSQL, while ogrinfo is not.
Please notice that I am a GDAL user, not a developer.
I think ogr2ogr warns, but only some of the times, depending on whitespace, query structure maybe or something else. So even ogr2ogr is affected by this issue, which is how I originally discovered it.
Looks like the difference is caused by the PostgreSQL driver treats SELECTs differently. The extra newline breaks that auto-detection and it takes the non-cursor path.
psql -f sends the query as-is, and of course, it has no trouble with the extra whitespace.
You renamed the title, but is the issue at all in duplicate column names or in starting the -sql parameter value with newline or other white space? Do you believe that it can ever work on Windows to use newlines on a command line? I know that ^ works partially like \ on Linux but that is not anything to trust in. Optfiles with linefeeds do work.
I guess it depends on how you look at it:
- the extra whitespace makes the driver take a different code path
- this code path doesn't handle duplicate columns properly
- the main one does, more or less (there's still the dropped not-
NULLissue, at least) - there are ways to hit the alternate code path without relying on whitespace; I haven't tried it, but
delete from ... returning ...comes to mind - I wasn't able to reproduce the whitespace issue using either
@file.sqlor extra spaces at the start; something must be trimming the strings - it might be possible to reproduce it using
ExecuteSQL, but I haven't tried
Please notice also that the FID column has special rules and NOT NULL is not reported as you maybe believe. Try plain ogrinfo for the table t1:
Layer name: t1
Geometry: Unknown (any)
Feature Count: 3
Extent: (10.000000, 10.000000) - (30.000000, 30.000000)
Layer SRS WKT:
(unknown)
FID Column = id
Geometry Column NOT NULL = geom
OGRFeature(t1):1
POINT (10 10)
Special rules are because FID must always be NOT NULL and unique and usually it is not a good idea to expose FIDs for users as editable. If you continue to experiment with duplicated column names you should perhaps do it with columns which are not used as FIDs.
hey , I want to contribute in this issue . Since , I am a beginner and I know about sql I want to participate . Could you please guide me so that I can move forward ??
Thanks for the interest in contributing to GDAL. This issue, while it may feel simple, is probably more complicated because the attributes with duplicated names are FID columns. I do not have good enough knowledge about the GDAL inner life with FIDs to say how to handle them in this case. @lnicola has maybe not yet done more experiments with non-FID columns.
The other issue with extra white space (format the query more nicely (there's an extra newline at the start)) might be easier but I do not know if newlines are generally supported in command line commands. If newlines would be accepted they should work in the same way on Linux and Windows.
I apologize that I cannot give more accurate technical advice, I am not a developer really.
Thanks for guiding me I'll try to get more knowledge about it . But, could you please suggest me some good first issue because I am eager to contribute in this org .
I think the difference in behaviour with a newline at the start of the query has an easy solution.
@lnicola is right that the leading newline breaks detection of 'SELECT' queries and causes the code to take a different (non-transactional) path.
There is code to skip leading whitespace in OGRPGDataSource::ExecuteSQL(), but it is janky. It only skips over a contiguous sequence of space (0x20) characters. All other forms of whitespace are not skipped. I think this can be fixed like this:
diff --git a/ogr/ogrsf_frmts/pg/ogrpgdatasource.cpp b/ogr/ogrsf_frmts/pg/ogrpgdatasource.cpp
index 93ef65bd83..4da0528565 100644
--- a/ogr/ogrsf_frmts/pg/ogrpgdatasource.cpp
+++ b/ogr/ogrsf_frmts/pg/ogrpgdatasource.cpp
@@ -2926,7 +2926,7 @@ OGRLayer * OGRPGDataSource::ExecuteSQL( const char *pszSQLCommand,
{
/* Skip leading spaces */
- while(*pszSQLCommand == ' ')
+ while(std::isspace(static_cast<unsigned char>(*pszSQLCommand)))
pszSQLCommand ++;
FlushCache(false);
As for the duplicate column names, I'd just resolve by specifying the columns I want instead of using * (which is a good habit to get into in any case). It's wasteful to select both sides of an inner join condition, you already know it will have the same values, so:
SELECT t1.id, t1.geom, t2.attr FROM t1 INNER JOIN t2 USING (id);
I think this can be fixed like this:
a PR with this is welcome
I think this can be fixed like this
Agreed.
As for the duplicate column names, I'd just resolve by specifying the columns I want instead of using * (which is a good habit to get into in any case). It's wasteful to select both sides of an inner join condition
I know, and it has been said on the thread before. But we can't prevent people from writing queries like these and in some cases the query itself is user-provided. Saying "your SQL is ~~inefficient~~ bad" isn't the solution here.
On a bit of a tangent, but the stated reason the PG driver has a different code path for SELECT is to avoid putting statements like CREATE DATABASE or VACUUM inside of a transaction block. In Postgres, there are a whole lot more statements that do work very happily in a transaction, than don't. If we are mainly concerned about CREATE DATABASE and VACCUM, perhaps we should test for those, and treat everything else as transactional?