gdal icon indicating copy to clipboard operation
gdal copied to clipboard

Repeated columns in SELECT silently dropped with -dialect SQLite

Open jidanni opened this issue 8 months ago • 10 comments

Which one's output is correct? sqlite, two columns:

$ { echo .headers on; echo 'SELECT code,code FROM extent LIMIT 1;';}|\
$ sqlite3 -csv /usr/share/proj/proj.db
code,code
1024,1024

Or ogr2ogr -dialect SQLite, one column:

$ ogr2ogr /vsistdout/ -f CSV raw.csv \
  -sql 'SELECT name,name FROM raw LIMIT 1' -dialect SQLite
name
"8145"

The second name gets thrown away just as if it didn't exist in the SELECT statement.

Well, at least with no -dialect SQLite we manage to get a warning:

$ ogr2ogr /vsistdout/ -f CSV \
  raw.csv -sql 'SELECT Name,Name FROM raw LIMIT 1'
Warning 1: Field 'Name' already exists. Renaming it as 'Name2'
Name,Name2
"8145","8145"

from which we come up with a workaround for -dialect SQLite:

$ ogr2ogr /vsistdout/ -f CSV raw.csv -sql \
  'SELECT Name,Name AS Name2 FROM raw LIMIT 1' -dialect SQLite
name,Name2
"8145","8145"

$ ogr2ogr --version
GDAL 3.10.3, released 2025/04/01

jidanni avatar Apr 10 '25 12:04 jidanni

Using correct SQL is not a workaround, it is a solution. The question is how hard we should work for informing users about what they do wrong. And warnign users about all variations is not as simple as it may feel.

CSV is probably the only so called format that allows having dublicate column names like in

code,code
1024,1024

I would say that having such CSV data is an user error. Especially if the data in the columns are not the same, but like

code,code
1024,2048

It may feel that SQLite or PostGIS can handle the dublicated column names created with an SQL query because you can do SELECT code,code FROM extent LIMIT 1; However, when you do something similar than what ogr2ogr does, for creating a new table CREATE TABLE foo AS SELECT code,code FROM extent LIMIT 1; then SQLite renames the second columns automatically and you will have columns code and code:1.

However, with PostGIS all you will get is an error: ERROR: column "code" specified more than once

jratike80 avatar Apr 10 '25 13:04 jratike80

I couldn't figure out which one is right even with the help of AI.

Anyway, I often use duplicate columns. E.g., putting an elevation into a label.

jidanni avatar Apr 10 '25 13:04 jidanni

I actually use that issue as a feature when doing things like

SELECT ST_something() AS geometry, * FROM ...

so that the columns with the ST_ function is used as the geometry function and overwrites the original geometry. The probably cleaner way would be to use "* EXCLUDE (geometry)", but SQLite doesn't support it

rouault avatar Apr 10 '25 13:04 rouault

What you mean by "which one is right"? Using alias is right. Using dublicate field names is something unexpected and programs behave in different ways in that situation. I will update my previous comment soon.

jratike80 avatar Apr 10 '25 13:04 jratike80

I tried Googling to find the official way SQL should work in such a situation but could not find the answer.

jidanni avatar Apr 10 '25 13:04 jidanni

I tried Googling to find the official way SQL should work in such a situation but could not find the answer.

When it comes to the "select" part, the behavior of select code,code from raw is covered around page 50 in the ANSI SQL-86 standard https://nvlpubs.nist.gov/nistpubs/Legacy/FIPS/fipspub127.pdf. The expected names in the result are code,code. I did not think too much about what they should be when the "value expression" is a function. By experiment it seems that behavior varies according to the SQL engine. select 1,1,1 from raw; gives column names 1,1,1 with SQLite, but ?column?,?column?-2,column?-3 with PostgreSQL.

When it comes to writing side, the standard says: The <column name> shall be different from the <column name> of any other <column definition> in the containing <table definition>. The standard does not say anything about what to do if user tries to violate the rule.

Link to SQL-86 found from https://modern-sql.com/standard. Nome of them defined what GDAL should do when it has data with dublicate column names, it is a different thing. Myself I would edit my data to be non-ambiguous.

jratike80 avatar Apr 10 '25 15:04 jratike80

I actually use that issue as a feature when doing things like

SELECT ST_something() AS geometry, * FROM ...

so that the columns with the ST_ function is used as the geometry function and overwrites the original geometry. The probably cleaner way would be to use "* EXCLUDE (geometry)", but SQLite doesn't support it

oh nice, I never understood before why that was ok

mdsumner avatar Apr 10 '25 21:04 mdsumner

Just like if one requests $ echo colour | fixspell™® -dialect American-English Here the user requested -dialect American-English so the fixspell™® program had better deliver color, not colour else it is violating its -dialect promise.

If it is going to violate its -dialect promise then at least it needs to make a peep on STDERR. It doesn't even need to justify its actions, but at least it needs to alert the user.

Likewise, with ogr2ogr -dialect SQLite, the user is asking for -dialect SQLite, dubious SQL or not, wrong or right. Anything less deserves at least some message on STDERR. Same in fact for https://github.com/OSGeo/gdal/issues/11890 . Perhaps the message could be "Invalid SQL, even if perhaps valid SQLite. Quitting."

jidanni avatar Apr 13 '25 05:04 jidanni

Do you admit that "create table as select code,code from raw" cannot work as-is in any SQL database?

jratike80 avatar Apr 13 '25 09:04 jratike80

Do you admit that "create table as select code,code from raw" cannot work as-is in any SQL database?

I am sorry but that is about three times more complex than my paltry knowledge of SQL.

jidanni avatar Apr 23 '25 10:04 jidanni