materialize
materialize copied to clipboard
Likely mis-optimization in `SELECT` subquery
What version of Materialize are you using?
main
How did you install Materialize?
Materialize Cloud
What is the issue?
The commands
create table test1 (a int, b int);
create table test2 (a int, b int);
create view test3 as select a, b, count(*) as c from test2 group by a, b;
insert into test1 values (1, 3);
insert into test2 values (1, 2), (1, 3);
select a, b, ( select c from test3 where a = test3.a and b = test3.b) from test1;
results in
CREATE TABLE
CREATE TABLE
CREATE VIEW
INSERT 0 1
INSERT 0 2
ERROR: Invalid data in source, saw retractions (1) for row that does not exist: [Int32(1), Int32(3), Null]
It appears to be important that test3
has a primary key (using test2
directly in its place does not provoke the error).
Further things that do and do not reproduce the error:
materialize=> select a, ( select c from test3 where a = test3.a ) from test1;
ERROR: Evaluation error: more than one record produced in subquery
materialize=> select b, ( select c from test3 where b = test3.b ) from test1;
ERROR: Evaluation error: more than one record produced in subquery
materialize=> create view test4 as select a, count(*) as c from test2 group by a;
CREATE VIEW
materialize=> create view test5 as select b, count(*) as c from test2 group by b;
CREATE VIEW
materialize=> select a, ( select c from test4 where a = test4.a ) from test1;
1 | 2
materialize=> select b, ( select c from test5 where b = test5.b ) from test1;
ERROR: Invalid data in source, saw retractions (1) for row that does not exist: [Int32(3), Null]
materialize=>
The evaluation errors seem fine, and are probably the correct answers.
The plan for the final query is
materialize=> explain select b, ( select c from test5 where b = test5.b ) from test1;
Source materialize.public.test1 (u73): +
| Project (#1) +
+
Source materialize.public.test2 (u74): +
| Filter (#1 = #1) +
| Project (#1) +
+
Query: +
%0 = Let l0 = +
| Get materialize.public.test2 (u74) +
| Filter (#1 = #1) +
| Project (#1) +
| Reduce group=(#0) +
| | agg count(true) +
| Project (#1) +
+
%1 = +
| Get materialize.public.test1 (u73) +
| Project (#1) +
| ArrangeBy () +
+
%2 = +
| Get %0 (l0) +
| Project () +
| Negate +
+
%3 = +
| Constant () +
+
%4 = +
| Union %2 %3 +
| Map null +
+
%5 = +
| Union %0 %4 +
+
%6 = +
| Join %1 %5 +
| | implementation = Differential %5 %1.()+
materialize=>
Relevant log output
No response
In postgres the same result returns:
alexander> create table test1 (a int, b int);
create table test2 (a int, b int);
create view test3 as select a, b, count(*) as c from test2 group by a, b;
insert into test1 values (1, 3);
insert into test2 values (1, 2), (1, 3);
select a, b, ( select c from test3 where a = test3.a and b = test3.b) from test1;
CREATE TABLE
CREATE TABLE
CREATE VIEW
INSERT 0 1
INSERT 0 2
more than one row returned by a subquery used as an expression
Time: 0.021s
Is the issue here that we are not throwing the same error?
If we ever return
ERROR: Invalid data in source, saw retractions (1) for row that does not exist: [Int32(3), Null]
we have introduced a collection with negative multiplicities. This seems like some flavor of consistency violation.
Yeah I mean is this an issue for a query which is ill-defined either way. Looking at the definition
select a, b, ( select c from test3 where a = test3.a and b = test3.b ) from test1;
I think there is a slight chance that what you really want to say is:
select a, b, ( select c from test3 where a = test1.a and b = test1.b ) from test1;
I just want to double-check on that before I dig deeper.
The issue is only about the inconsistency. The query is probably "the wrong query", but we shouldn't be introducing negative multiplicity collections without erroring. So, the bug is probably "postgres gives a clear error; we only give an error if you look at the results, and might not if you materialize them, and then use them in another query".
Got it.
select a, b, ( select c from test3) from test1;
produces the expected error, whereas
select a, b, ( select c from test3 where a = a and b = b ) from test1
and the from simpler
select a, b, ( select c from test3 where a = a and b = b ) from test1 -- equivalent to the above
select a, b, ( select c from test3 where a = a ) from test1 -- only the first predicate
only the first one has this issue, so I think the first thing is to look at differences in the plans produced by those three.
Looks like an optimization bug, the decorrelated plans for the two queries are similar, but the optimized plans are different. Assigning to myself and I will find somebody to look at this ASAP.
We can simplify and look just at:
select ( select c from test3 where a = a and b = b );
vs
select ( select c from test3 where a = a );
Assigning this to @ggevay. Please take a look at that after you have opened a PR for the IN
list planning.