gdal icon indicating copy to clipboard operation
gdal copied to clipboard

PostgreSQL driver mangles duplicate columns

Open lnicola opened this issue 4 years ago • 16 comments

Expected behavior and actual behavior.

At least with the PostgreSQL/PostGIS driver, ogr2ogr and ogrinfo don't seem to play well with duplicate source column names.

Steps to reproduce the problem.

First, create a PostGIS database and two simple tables:

# create table t1(id int not null primary key, geom geometry not null);
CREATE TABLE
# insert into t1 values (1, 'point (10 10)'), (2, 'point (20 20)'), (3, 'point (30 30)');
INSERT 0 3
# create table t2(id int not null references t1(id), attr int not null);
CREATE TABLE
# insert into t2 values (1, 100), (2, 200), (3, 300);
INSERT 0 3

Then try to query them using ogrinfo:

$ docker run --rm -v /etc/passwd:/etc/passwd -v /etc/group:/etc/group -v /var/run/postgresql:/var/run/postgresql -u $(id -u):$(id -g) osgeo/gdal:ubuntu-full-3.3.1 ogrinfo -sql 'select t1.*, t2.* from t1 inner join t2 using (id)' 'PG:dbname=mydb'
INFO: Open of `PG:dbname=mydb'
      using driver `PostgreSQL' successful.

Layer name: sql_statement
Geometry: Unknown (any)
Feature Count: 3
Extent: (10.000000, 10.000000) - (30.000000, 30.000000)
Layer SRS WKT:
(unknown)
Geometry Column NOT NULL = geom
id: Integer (0.0) NOT NULL
id: Integer (0.0)
attr: Integer (0.0) NOT NULL
OGRFeature(sql_statement):0
  id (Integer) = 1
  attr (Integer) = 100
  POINT (10 10)

OGRFeature(sql_statement):1
  id (Integer) = 2
  attr (Integer) = 200
  POINT (20 20)

OGRFeature(sql_statement):2
  id (Integer) = 3
  attr (Integer) = 300
  POINT (30 30)

Notice how the second id column shows up as nullable, even though it's not. But that doesn't matter, let's proceed and format the query more nicely (there's an extra newline at the start):

$ docker run --rm -v /etc/passwd:/etc/passwd -v /etc/group:/etc/group -v /var/run/postgresql:/var/run/postgresql -u $(id -u):$(id -g) osgeo/gdal:ubuntu-full-3.3.1 ogrinfo -sql '
> select t1.*, t2.* from t1 inner join t2 using (id)' 'PG:dbname=mydb'

INFO: Open of `PG:dbname=mydb'
      using driver `PostgreSQL' successful.

Layer name: sql_statement
Geometry: Unknown (any)
Feature Count: 3
Extent: (10.000000, 10.000000) - (30.000000, 30.000000)
Layer SRS WKT:
(unknown)
id: Integer (0.0)
attr: Integer (0.0)
OGRFeature(sql_statement):1
  attr (Integer) = 100
  POINT (10 10)

OGRFeature(sql_statement):2
  attr (Integer) = 200
  POINT (20 20)

OGRFeature(sql_statement):3
  attr (Integer) = 300
  POINT (30 30)

Now we only have one id column, but:

  • the geom column no longer shows up in the column list at the top
  • id is still nullable
  • all returned ids are NULL, so they're not listed in the features
  • I didn't set a SRID on the geometries, but it would be gone if I had

Operating system

N/A (Docker)

GDAL version and provenance

Happens at least on 3.3.1 (official Docker image) and 2.3.2.

lnicola avatar Aug 23 '21 13:08 lnicola

One id is used as a feature ID (FID) that is handled with special rules. You can adjust your SQL for example like

-sql "select t1.*, t1.id as id_1, t2.*, t2.id as id_2 from t1 inner join t2 using (id)"

and get result

Layer name: sql_statement
Geometry: Unknown (any)
Feature Count: 3
Extent: (10.000000, 10.000000) - (30.000000, 30.000000)
Layer SRS WKT:
(unknown)
Geometry Column NOT NULL = geom
id: Integer (0.0)
id_1: Integer (0.0) NOT NULL
id: Integer (0.0)
attr: Integer (0.0) NOT NULL
id_2: Integer (0.0) NOT NULL
OGRFeature(sql_statement):0
  id (Integer) = 1
  id_1 (Integer) = 1
  attr (Integer) = 100
  id_2 (Integer) = 1
  POINT (10 10)

I suppose that by finetuning the SQL you can get a result that makes you happy.

jratike80 avatar Aug 23 '21 14:08 jratike80

The only solution is to avoid the duplicate column names. Even with your variant of the query, the extra newline makes a difference in the behavior, not only in the number of non-NULL columns, but also in the returned SRS:

$ docker run --rm -v /etc/passwd:/etc/passwd -v /etc/group:/etc/group -v /var/run/postgresql:/var/run/postgresql -u $(id -u):$(id -g) osgeo/gdal:ubuntu-full-3.3.1 ogrinfo -sql 'select t1.*, t1.id as id_1, t2.*, t2.id as id_2 from t1 inner join t2 using (id)' 'PG:dbname=mydb'
INFO: Open of `PG:dbname=mydb'
      using driver `PostgreSQL' successful.

Layer name: sql_statement
Geometry: Unknown (any)
Feature Count: 3
Extent: (10.000000, 10.000000) - (30.000000, 30.000000)
Layer SRS WKT:
GEOGCRS["WGS 84",
    ENSEMBLE["World Geodetic System 1984 ensemble",
        MEMBER["World Geodetic System 1984 (Transit)"],
        MEMBER["World Geodetic System 1984 (G730)"],
        MEMBER["World Geodetic System 1984 (G873)"],
        MEMBER["World Geodetic System 1984 (G1150)"],
        MEMBER["World Geodetic System 1984 (G1674)"],
        MEMBER["World Geodetic System 1984 (G1762)"],
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]],
        ENSEMBLEACCURACY[2.0]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    USAGE[
        SCOPE["Horizontal component of 3D system."],
        AREA["World."],
        BBOX[-90,-180,90,180]],
    ID["EPSG",4326]]
Data axis to CRS axis mapping: 2,1
Geometry Column NOT NULL = geom
id: Integer (0.0)
id_1: Integer (0.0) NOT NULL
id: Integer (0.0)
attr: Integer (0.0) NOT NULL
id_2: Integer (0.0) NOT NULL
OGRFeature(sql_statement):0
  id (Integer) = 1
  id_1 (Integer) = 1
  attr (Integer) = 100
  id_2 (Integer) = 1
  POINT (10 10)

OGRFeature(sql_statement):1
  id (Integer) = 2
  id_1 (Integer) = 2
  attr (Integer) = 200
  id_2 (Integer) = 2
  POINT (20 20)

OGRFeature(sql_statement):2
  id (Integer) = 3
  id_1 (Integer) = 3
  attr (Integer) = 300
  id_2 (Integer) = 3
  POINT (30 30)

$ docker run --rm -v /etc/passwd:/etc/passwd -v /etc/group:/etc/group -v /var/run/postgresql:/var/run/postgresql -u $(id -u):$(id -g) osgeo/gdal:ubuntu-full-3.3.1 ogrinfo -sql '
select t1.*, t1.id as id_1, t2.*, t2.id as id_2 from t1 inner join t2 using (id)' 'PG:dbname=mydb'
INFO: Open of `PG:dbname=mydb'
      using driver `PostgreSQL' successful.

Layer name: sql_statement
Geometry: Unknown (any)
Feature Count: 3
Extent: (10.000000, 10.000000) - (30.000000, 30.000000)
Layer SRS WKT:
(unknown)
id: Integer (0.0)
id_1: Integer (0.0)
attr: Integer (0.0)
id_2: Integer (0.0)
OGRFeature(sql_statement):1
  id_1 (Integer) = 1
  attr (Integer) = 100
  id_2 (Integer) = 1
  POINT (10 10)

OGRFeature(sql_statement):2
  id_1 (Integer) = 2
  attr (Integer) = 200
  id_2 (Integer) = 2
  POINT (20 20)

OGRFeature(sql_statement):3
  id_1 (Integer) = 3
  attr (Integer) = 300
  id_2 (Integer) = 3
  POINT (30 30)

lnicola avatar Aug 23 '21 15:08 lnicola

Sorry, I can't test extra newline on Windows. Is it somehow essential for you to use that?

jratike80 avatar Aug 23 '21 15:08 jratike80

It's not essential and you can avoid all this trouble if you don't have duplicate column names, but it's a bug: Postgres ignores whitespace and getting different behavior when changing that is quite surprising and can hide the root cause (the column names).

lnicola avatar Aug 23 '21 15:08 lnicola

I think that GDAL data model does not support duplicate column names. By running ogr2ogr instead of ogrinfo I can see a message

Warning 1: Field 'id' already exists. Renaming it as 'id2'

The behavior with extra line and that it affects how SRS is reported is indeed odd. But maybe whatever happens does not happen at PostgreSQL level. Have you tried to use SQL with extra lines by using the syntax -sql @filename syntax? Perhaps the problem is in using unescaped linefeed in the command line?

What do you suggest as a fix? Should ogrinfo throw a warning about duplicate field names and rename another field automatically like ogr2ogr does? As a comparison, PSQL reports duplicate fields with SELECT but in another context they do not work

# create table tt_test as select t1.*, t2.* from t1 inner join t2 using (id);
ERROR:  column "id" specified more than once

Here ogr2ogr is more user friendly than PostgreSQL, while ogrinfo is not.

Please notice that I am a GDAL user, not a developer.

jratike80 avatar Aug 23 '21 17:08 jratike80

I think ogr2ogr warns, but only some of the times, depending on whitespace, query structure maybe or something else. So even ogr2ogr is affected by this issue, which is how I originally discovered it.

Looks like the difference is caused by the PostgreSQL driver treats SELECTs differently. The extra newline breaks that auto-detection and it takes the non-cursor path.

psql -f sends the query as-is, and of course, it has no trouble with the extra whitespace.

lnicola avatar Aug 23 '21 17:08 lnicola

You renamed the title, but is the issue at all in duplicate column names or in starting the -sql parameter value with newline or other white space? Do you believe that it can ever work on Windows to use newlines on a command line? I know that ^ works partially like \ on Linux but that is not anything to trust in. Optfiles with linefeeds do work.

jratike80 avatar Aug 23 '21 17:08 jratike80

I guess it depends on how you look at it:

  • the extra whitespace makes the driver take a different code path
  • this code path doesn't handle duplicate columns properly
  • the main one does, more or less (there's still the dropped not-NULL issue, at least)
  • there are ways to hit the alternate code path without relying on whitespace; I haven't tried it, but delete from ... returning ... comes to mind
  • I wasn't able to reproduce the whitespace issue using either @file.sql or extra spaces at the start; something must be trimming the strings
  • it might be possible to reproduce it using ExecuteSQL, but I haven't tried

lnicola avatar Aug 23 '21 18:08 lnicola

Please notice also that the FID column has special rules and NOT NULL is not reported as you maybe believe. Try plain ogrinfo for the table t1:

Layer name: t1
Geometry: Unknown (any)
Feature Count: 3
Extent: (10.000000, 10.000000) - (30.000000, 30.000000)
Layer SRS WKT:
(unknown)
FID Column = id
Geometry Column NOT NULL = geom
OGRFeature(t1):1
  POINT (10 10)

Special rules are because FID must always be NOT NULL and unique and usually it is not a good idea to expose FIDs for users as editable. If you continue to experiment with duplicated column names you should perhaps do it with columns which are not used as FIDs.

jratike80 avatar Aug 23 '21 18:08 jratike80

hey , I want to contribute in this issue . Since , I am a beginner and I know about sql I want to participate . Could you please guide me so that I can move forward ??

jatin142002 avatar Sep 21 '21 05:09 jatin142002

Thanks for the interest in contributing to GDAL. This issue, while it may feel simple, is probably more complicated because the attributes with duplicated names are FID columns. I do not have good enough knowledge about the GDAL inner life with FIDs to say how to handle them in this case. @lnicola has maybe not yet done more experiments with non-FID columns.

The other issue with extra white space (format the query more nicely (there's an extra newline at the start)) might be easier but I do not know if newlines are generally supported in command line commands. If newlines would be accepted they should work in the same way on Linux and Windows.

I apologize that I cannot give more accurate technical advice, I am not a developer really.

jratike80 avatar Sep 21 '21 06:09 jratike80

Thanks for guiding me I'll try to get more knowledge about it . But, could you please suggest me some good first issue because I am eager to contribute in this org .

jatin142002 avatar Sep 21 '21 10:09 jatin142002

I think the difference in behaviour with a newline at the start of the query has an easy solution.

@lnicola is right that the leading newline breaks detection of 'SELECT' queries and causes the code to take a different (non-transactional) path.

There is code to skip leading whitespace in OGRPGDataSource::ExecuteSQL(), but it is janky. It only skips over a contiguous sequence of space (0x20) characters. All other forms of whitespace are not skipped. I think this can be fixed like this:

diff --git a/ogr/ogrsf_frmts/pg/ogrpgdatasource.cpp b/ogr/ogrsf_frmts/pg/ogrpgdatasource.cpp
index 93ef65bd83..4da0528565 100644
--- a/ogr/ogrsf_frmts/pg/ogrpgdatasource.cpp
+++ b/ogr/ogrsf_frmts/pg/ogrpgdatasource.cpp
@@ -2926,7 +2926,7 @@ OGRLayer * OGRPGDataSource::ExecuteSQL( const char *pszSQLCommand,

 {
     /* Skip leading spaces */
-    while(*pszSQLCommand == ' ')
+    while(std::isspace(static_cast<unsigned char>(*pszSQLCommand)))
         pszSQLCommand ++;

     FlushCache(false);

As for the duplicate column names, I'd just resolve by specifying the columns I want instead of using * (which is a good habit to get into in any case). It's wasteful to select both sides of an inner join condition, you already know it will have the same values, so:

SELECT t1.id, t1.geom, t2.attr FROM t1 INNER JOIN t2 USING (id);

direvus avatar Nov 05 '21 03:11 direvus

I think this can be fixed like this:

a PR with this is welcome

rouault avatar Nov 05 '21 10:11 rouault

I think this can be fixed like this

Agreed.

As for the duplicate column names, I'd just resolve by specifying the columns I want instead of using * (which is a good habit to get into in any case). It's wasteful to select both sides of an inner join condition

I know, and it has been said on the thread before. But we can't prevent people from writing queries like these and in some cases the query itself is user-provided. Saying "your SQL is ~~inefficient~~ bad" isn't the solution here.

lnicola avatar Nov 07 '21 06:11 lnicola

On a bit of a tangent, but the stated reason the PG driver has a different code path for SELECT is to avoid putting statements like CREATE DATABASE or VACUUM inside of a transaction block. In Postgres, there are a whole lot more statements that do work very happily in a transaction, than don't. If we are mainly concerned about CREATE DATABASE and VACCUM, perhaps we should test for those, and treat everything else as transactional?

direvus avatar Nov 08 '21 08:11 direvus