SQL: "ambiguous column reference" in attempted ORDER BY and/or WHERE with JOIN
This works in Postgres, but not in SuperDB.
$ super -c "SELECT * FROM 'integers.sup' LEFT OUTER JOIN 'integers2.sup' ON integers.i=integers2.k ORDER BY i;"
"i": ambiguous column reference at line 1, column 97:
SELECT * FROM 'integers.sup' LEFT OUTER JOIN 'integers2.sup' ON integers.i=integers2.k ORDER BY i;
~
Details
Repro is with super commit e75dfda. This was found via a query from a sqllogictest.
This works in Postgres.
$ psql postgres
psql (17.5 (Homebrew))
Type "help" for help.
postgres=# CREATE TABLE integers(i INTEGER, j INTEGER);
CREATE TABLE
postgres=# INSERT INTO integers VALUES (1, 2), (2, 3), (3, 4);
INSERT 0 3
postgres=# CREATE TABLE integers2(k INTEGER, l INTEGER);
CREATE TABLE
postgres=# INSERT INTO integers2 VALUES (1, 10), (2, 20);
INSERT 0 2
postgres=# SELECT * FROM integers LEFT OUTER JOIN integers2 ON integers.i=integers2.k ORDER BY i;
i | j | k | l
---+---+---+----
1 | 2 | 1 | 10
2 | 3 | 2 | 20
3 | 4 | |
(3 rows)
However it's not yet supported in SuperDB.
$ super -version
Version: e75dfdaf1
$ cat integers.sup
{i:1::int32,j:2::int32}
{i:2::int32,j:3::int32}
{i:3::int32,j:4::int32}
$ cat integers2.sup
{k:1::int32,l:10::int32}
{k:2::int32,l:20::int32}
$ super -c "SELECT * FROM 'integers.sup' LEFT OUTER JOIN 'integers2.sup' ON integers.i=integers2.k ORDER BY i;"
"i": ambiguous column reference at line 1, column 97:
SELECT * FROM 'integers.sup' LEFT OUTER JOIN 'integers2.sup' ON integers.i=integers2.k ORDER BY i;
~
Of course, if we drop the ORDER BY it works fine and we can see the column i is in there.
$ super -c "SELECT * FROM 'integers.sup' LEFT OUTER JOIN 'integers2.sup' ON integers.i=integers2.k;"
{i:1::int32,j:2::int32,k:1::int32,l:10::int32}
{i:2::int32,j:3::int32,k:2::int32,l:20::int32}
{i:3::int32,j:4::int32}
Also, another sqllogictest defined in the same file shows a similar complaint in the WHERE clause as well.
$ super -c "SELECT * FROM 'integers.sup' LEFT OUTER JOIN 'integers2.sup' ON integers.i=integers2.k WHERE k IS NOT NULL ORDER BY i;"
"k": ambiguous column reference at line 1, column 95:
SELECT * FROM 'integers.sup' LEFT OUTER JOIN 'integers2.sup' ON integers.i=integers2.k WHERE k IS NOT NULL ORDER BY i;
~
"i": ambiguous column reference at line 1, column 118:
SELECT * FROM 'integers.sup' LEFT OUTER JOIN 'integers2.sup' ON integers.i=integers2.k WHERE k IS NOT NULL ORDER BY i;
~
After coming across more queries that trigger the "ambiguous column reference" message, I've got some easier repro recipes. I can't say with certainty that these all share the same root cause, but I'll submit them here in the event they're a better place to start on improvements.
Given input data:
$ cat a.json
{"a":1}
{"a":2}
$ cat b.json
{"b":3}
{"b":4}
Triggering the error at super commit 230007c:
$ super -version
Version: 230007cb7
$ super -c "SELECT a,b FROM 'a.json', 'b.json';"
"a": ambiguous column reference at line 1, column 8:
SELECT a,b FROM 'a.json', 'b.json';
~
"b": ambiguous column reference at line 1, column 10:
SELECT a,b FROM 'a.json', 'b.json';
~
Compared with a legacy SQL system handling it ok:
$ duckdb --version
v1.3.2 (Ossivalis) 0b83e5d2f6
$ duckdb -c "SELECT a,b FROM 'a.json', 'b.json';"
┌───────┬───────┐
│ a │ b │
│ int64 │ int64 │
├───────┼───────┤
│ 1 │ 3 │
│ 2 │ 3 │
│ 1 │ 4 │
│ 2 │ 4 │
└───────┴───────┘
I can make super happy if I fully qualify the field names.
$ super -c "SELECT a.a,b.b FROM 'a.json', 'b.json';"
{a:1,b:3}
{a:2,b:3}
{a:1,b:4}
{a:2,b:4}
I also stumbled onto how I can avoid the ambiguous column reference and instead get a bad query result if I add a third input to the mix.
$ cat c.json
{"c":5}
{"c":6}
$ super -c "SELECT a,b,c FROM 'a.json', 'b.json', 'c.json';"
{a:1,b:error("missing"),c:error("missing")}
{a:2,b:error("missing"),c:error("missing")}
{a:1,b:error("missing"),c:error("missing")}
{a:2,b:error("missing"),c:error("missing")}
{a:1,b:error("missing"),c:error("missing")}
{a:2,b:error("missing"),c:error("missing")}
{a:1,b:error("missing"),c:error("missing")}
{a:2,b:error("missing"),c:error("missing")}
Here's a summary of a recent group discussion on this topic.
Revisiting the original repro query with current super commit 857796a, @mccanne explained that the error is because when operating on dynamic data in relational mode it's not currently possible to resolve the bare reference to i in ORDER BY i to one of the join's two inputs. So like we saw in the other examples in https://github.com/brimdata/super/issues/5983#issuecomment-3134296222, one way currently around this is to make the field reference in the ORDER BY explicit, e.g.,
$ super -version
Version: 857796aa7
$ super -c "SELECT * FROM 'integers.sup' LEFT OUTER JOIN 'integers2.sup' ON integers.i=integers2.k ORDER BY integers.i;"
{i:1::int32,j:2::int32,k:1::int32,l:10::int32}
{i:2::int32,j:3::int32,k:2::int32,l:20::int32}
{i:3::int32,j:4::int32}
The original query could also work if the static schema information is available since this would allow for resolution of the bare reference to i. The changes in #6097 made this possible with Parquet data, so we can see the benefits if we convert our input files from SUP to Parquet.
$ super -f parquet -o integers.parquet integers.sup
$ super -f parquet -o integers2.parquet integers2.sup
$ super -c "SELECT * FROM 'integers.parquet' LEFT OUTER JOIN 'integers2.parquet' ON integers.i=integers2.k ORDER BY i;"
{i:1::int32,j:2::int32,k:1::int32,l:10::int32}
{i:2::int32,j:3::int32,k:2::int32,l:20::int32}
{i:3::int32,j:4::int32}
A similar approach should be possible when we've got support for SQL with schemas when using persistent super db storage (e.g., CREATE TABLE ...). Schema support is also expected to be added for CSUP files.
For when still working off dynamic data such as the original SUP, a couple ideas were proposed for improvement:
- An option could be provided to have
superread fully through the data once to create a fused schema. - The error message could improved to something that would guide the user toward making the reference explicit in the query or invoking the proposed option to create the fused schema.
The above observations also apply to the other repro example shown in https://github.com/brimdata/super/issues/5983#issuecomment-3134296222, i.e., the repro query can be made to work with bare references to a and b if the inputs are converted to Parquet files.
$ super -f parquet -o a.parquet a.json
$ super -f parquet -o b.parquet b.json
$ super -c "SELECT a,b FROM 'a.parquet', 'b.parquet';"
{a:1,b:3}
{a:2,b:3}
{a:1,b:4}
{a:2,b:4}
Here's a new twist I just stumbled upon: It looks like leveraging the static schemas from Parquet files is currently dependent on the presence of .parquet extensions. Picking up where I left off in the most recent comment above:
$ super -version
Version: 857796aa7
$ ln -s a.parquet a.parq
$ ln -s b.parquet b.parq
$ super -c "SELECT a,b FROM 'a.parq', 'b.parq';"
"a": ambiguous column reference at line 1, column 8:
SELECT a,b FROM 'a.parq', 'b.parq';
~
"b": ambiguous column reference at line 1, column 10:
SELECT a,b FROM 'a.parq', 'b.parq';
~
I checked this with the Dev team and they pointed out that this can be made to work by adding explicit format references, e.g.,
$ super -c "SELECT a,b FROM 'a.parq' (FORMAT parquet), 'b.parq' (FORMAT parquet);"
{a:1,b:3}
{a:1,b:4}
{a:2,b:3}
{a:2,b:4}
This rhymes with other systems that rely on file extensions as part of auto-detecting formats, e.g.,
$ duckdb --version
v1.3.2 (Ossivalis) 0b83e5d2f6
$ duckdb -c "SELECT a,b FROM 'a.parquet', 'b.parquet';"
┌───────┬───────┐
│ a │ b │
│ int64 │ int64 │
├───────┼───────┤
│ 1 │ 3 │
│ 2 │ 3 │
│ 1 │ 4 │
│ 2 │ 4 │
└───────┴───────┘
$ duckdb -c "SELECT a,b FROM 'a.parq', 'b.parq';"
Binder Error:
No extension found that is capable of reading the file "a.parq"
* If this file is a supported file format you can explicitly use the reader functions, such as read_csv, read_json or read_parquet
$ duckdb -c "SELECT a,b FROM read_parquet('a.parq'), read_parquet('b.parq');"
┌───────┬───────┐
│ a │ b │
│ int64 │ int64 │
├───────┼───────┤
│ 1 │ 3 │
│ 2 │ 3 │
│ 1 │ 4 │
│ 2 │ 4 │
└───────┴───────┘
However, what threw me as a user is that the file extension didn't seem to matter for super with this simpler query:
$ super -c "SELECT a FROM 'a.parq';"
{a:1}
{a:2}
But it still mattered here:
$ duckdb -c "SELECT a FROM 'a.parq';"
Binder Error:
No extension found that is capable of reading the file "a.parq"
* If this file is a supported file format you can explicitly use the reader functions, such as read_csv, read_json or read_parquet
The changes in linked PR #6341 improve on some examples shown above, as now schema information is obtained from Parquet files even if the extension is not .parquet and there's no (FORMAT PARQUET) hint, e.g., this now works where it didn't previously:
$ super -version
Version: 6f83f8d4f
$ super -c "SELECT a,b FROM 'a.parq', 'b.parq';"
{a:1,b:3}
{a:2,b:3}
{a:1,b:4}
{a:2,b:4}