noisepage
noisepage copied to clipboard
TPC-H : Generate incorrect operator tree for nested query
Bug Report
Summary
TPCH Query 2 does not stop executing. The transformer.convertToOpExpression() generates an incorrect op_tree that contains the InnerJoin operator repeatedly (InnerJoin -> InnerJoin -> InnerJoin -> ...). Thus, OptimizeGroup tasks never stop.
Environment
I don't think it's environment-specific. I run on macOS 10.14+ with Clang 8.0+
Steps to Reproduce
- Compile with the following args: -DCMAKE_BUILD_TYPE=Release ..
- Run terrier executable
- Connect to the terrier terminal by executing : psql -h localhost -p 15721 -d terrier
- Create the tables from "https://github.com/oltpbenchmark/oltpbench/blob/master/src/com/oltpbenchmark/benchmarks/tpch/ddls/tpch-postgres-ddl.sql" .
- Run the query
-- TPC-H Query 2
select
s_acctbal,
s_name,
n_name,
p_partkey,
p_mfgr,
s_address,
s_phone,
s_comment
from
part,
supplier,
partsupp,
nation,
region
where
p_partkey = ps_partkey
and s_suppkey = ps_suppkey
and p_size = 15
and p_type like '%BRASS'
and s_nationkey = n_nationkey
and n_regionkey = r_regionkey
and r_name = 'EUROPE'
and ps_supplycost = (
select
min(ps_supplycost)
from
partsupp,
supplier,
nation,
region
where
p_partkey = ps_partkey
and s_suppkey = ps_suppkey
and s_nationkey = n_nationkey
and n_regionkey = r_regionkey
and r_name = 'EUROPE'
)
order by
s_acctbal desc,
n_name,
s_name,
p_partkey
limit 100;
This query never stops running. We expect it to return an empty result.
2020-11-13: this query seems to run (though yes, I think it hits an optimizer timeout).
Patch IndexJoinTranslator to only initialize the table_pm_ if table columns are actually used:
diff --git a/src/execution/compiler/operator/index_join_translator.cpp b/src/execution/compiler/operator/index_join_translator.cpp
index 72277602b..ab8914745 100644
--- a/src/execution/compiler/operator/index_join_translator.cpp
+++ b/src/execution/compiler/operator/index_join_translator.cpp
@@ -21,7 +21,6 @@ IndexJoinTranslator::IndexJoinTranslator(const planner::IndexJoinPlanNode &plan,
: OperatorTranslator(plan, compilation_context, pipeline, brain::ExecutionOperatingUnitType::DUMMY),
input_oids_(plan.CollectInputOids()),
table_schema_(GetCodeGen()->GetCatalogAccessor()->GetSchema(plan.GetTableOid())),
- table_pm_(GetCodeGen()->GetCatalogAccessor()->GetTable(plan.GetTableOid())->ProjectionMapForOids(input_oids_)),
index_schema_(GetCodeGen()->GetCatalogAccessor()->GetIndexSchema(plan.GetIndexOid())),
index_pm_(GetCodeGen()->GetCatalogAccessor()->GetIndex(plan.GetIndexOid())->GetKeyOidToOffsetMap()),
index_iter_(GetCodeGen()->MakeFreshIdentifier("index_iter")),
@@ -30,6 +29,10 @@ IndexJoinTranslator::IndexJoinTranslator(const planner::IndexJoinPlanNode &plan,
hi_index_pr_(GetCodeGen()->MakeFreshIdentifier("hi_index_pr")),
table_pr_(GetCodeGen()->MakeFreshIdentifier("table_pr")),
slot_(GetCodeGen()->MakeFreshIdentifier("slot")) {
+ if (!input_oids_.empty()) {
+ table_pm_ = GetCodeGen()->GetCatalogAccessor()->GetTable(plan.GetTableOid())->ProjectionMapForOids(input_oids_);
+ }
+
pipeline->RegisterSource(this, Pipeline::Parallelism::Serial);
if (plan.GetJoinPredicate() != nullptr) {
compilation_context->Prepare(*plan.GetJoinPredicate());
However, it chokes on the AGGREGATE_MIN.
terminate called after throwing an instance of 'noisepage::NotImplementedException'
what(): Code generation for expression type 'AGGREGATE_MIN' not supported.
Here is an easier way to reproduce:
noisepage=# create table foo (a int);
CREATE TABLE
noisepage=# insert into foo values (1);
INSERT 0 1
noisepage=# insert into foo values (2);
INSERT 0 1
noisepage=# select * from foo where a = (select min(a) from foo);
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
I expect that fixing the AGGREGATE_MIN would allow the query to at least run.
The following queries work fine
noisepage=# create table foo (a int);
CREATE TABLE
noisepage=# insert into foo values (1);
INSERT 0 1
noisepage=# insert into foo values (2);
INSERT 0 1
noisepage=# select MIN(a) from foo;
?column?
----------
1
(1 row)
So the error message
terminate called after throwing an instance of 'noisepage::NotImplementedException'
what(): Code generation for expression type 'AGGREGATE_MIN' not supported.
is probably not accurate
If you alter the TPC-H query to take the aggregate out of the nested query, the query still never finishes executing. This suggests to me that there are actually two separate issues here.
- Something about nested queries with joins are broken
- Aggregates in a nested query are broken.
Altered query:
select
s_acctbal,
s_name,
n_name,
p_partkey,
p_mfgr,
s_address,
s_phone,
s_comment
from
part,
supplier,
partsupp,
nation,
region
where
p_partkey = ps_partkey
and s_suppkey = ps_suppkey
and p_size = 15
and p_type like '%BRASS'
and s_nationkey = n_nationkey
and n_regionkey = r_regionkey
and r_name = 'EUROPE'
and ps_supplycost IN (
select
ps_supplycost
from
partsupp,
supplier,
nation,
region
where
p_partkey = ps_partkey
and s_suppkey = ps_suppkey
and s_nationkey = n_nationkey
and n_regionkey = r_regionkey
and r_name = 'EUROPE'
)
order by
s_acctbal desc,
n_name,
s_name,
p_partkey
limit 100;
I actually am not sure if it's two separate issues. When taking a look at the physical plan generated by the optimizer for nested queries, there's a lot of things that don't exactly look right. Such as the entire query being converted ~~to an Inner Nested Loop Join with no predicate,~~ with one of the join children being a sequential scan with no output schema. The aggregate error is coming because the aggregate node is in a place where it doesn't really make sense to have an aggregate.
Also other nested queries are breaking, such as the following
noisepage=# create table foo (a int);
CREATE TABLE
noisepage=# insert into foo values (1);
INSERT 0 1
noisepage=# insert into foo values (2);
INSERT 0 1
noisepage=# select * from foo where a in (select a from foo);
a
---
1
2
1
2
(4 rows)
If we run the same query on postgres we get different results
postgres=# create table foo (a int);
CREATE TABLE
postgres=# insert into foo values(1);
INSERT 0 1
postgres=# insert into foo values(2);
INSERT 0 1
postgres=# select * from foo where a IN (select a from foo);
a
---
1
2
(2 rows)
Some more strange nested query behavior, I'm not sure if it's related or not.
The following query actually works (which I wouldn't necessarily suspect from my previous comment)
noisepage=# create table foo (a int);
CREATE TABLE
noisepage=# create table foo2 (a int);
CREATE TABLE
noisepage=# insert into foo values (1);
INSERT 0 1
noisepage=# insert into foo2 values (1);
INSERT 0 1
noisepage=# insert into foo2 values (2);
INSERT 0 1
noisepage=# insert into foo values (2);
INSERT 0 1
noisepage=# select * from foo where a in (select a from foo2);
a
---
1
2
(2 rows)
Also the following query segfaults
noisepage=# create table foo (a int);
CREATE TABLE
noisepage=# create table foo2 (a int);
CREATE TABLE
noisepage=# insert into foo values (1);
INSERT 0 1
noisepage=# insert into foo2 values (1);
INSERT 0 1
noisepage=# insert into foo2 values (2);
INSERT 0 1
noisepage=# insert into foo values (2);
INSERT 0 1
noisepage=# select * from foo where a in (select min(a) from foo2);
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
Yet another broken nested query Noisepage:
noisepage=# create table foo (a int, b int);
CREATE TABLE
noisepage=# insert into foo values (1, 2);
INSERT 0 1
noisepage=# insert into foo values (2, 1);
INSERT 0 1
noisepage=# select * from foo where a in (select b from foo);
a | b
---+---
(0 rows)
Postgres:
postgres=# create table foo (a int, b int);
CREATE TABLE
postgres=# insert into foo values (1, 2);
INSERT 0 1
postgres=# insert into foo values (2, 1);
INSERT 0 1
postgres=# select * from foo where a in (select b from foo);
a | b
---+---
1 | 2
2 | 1
(2 rows)
It seems definitely that when a nested query has the same table on the inside and outside something fails. What I think is happening is the following:
- The nested query gets converted to a join on the outer query and inner query.
- For example the query
select * from foo where a in (select a from foo);
will generate an op tree with a Filter as the root, with a single Mark Join child. The join will have no predicate and two children, both children being a Get from foo. (Check a previous comment to see how this query's output fails).
- For example the query
- The two tables in the join are exactly the same AND they don't have any table aliases beyond the table name itself.
- Much of the optimizer code makes the assumption for joins that if the two tables are the same, then at least they will have different aliases.
- This affects things like how predicates get pushed down and others.
- For example when we run the op tree mentioned above through the optimizer the predicate gets pushed down to the left side of the join from this code (recall that both sides of the join have the same table alias) https://github.com/cmu-db/noisepage/blob/de5ed288fed474d1f13242c5cce89982b30163de/src/optimizer/rules/rewrite_rules.cpp#L162-L185 That is not correct and at least partially explains why we get a cross product as the result instead of the correct answer (It doesn't matter how we filter the left table, there's no way we end up with the correct answer unless the predicate is moved to the join predicate).
- Because this assumption isn't held, all sorts of stuff breaks in ways that I'm not quite sure yet.
So one solution might be to check if the two tables are the same and assign them random aliases. Seems a bit hacky but I'll try it out. Another potential solution could be to find everywhere that makes the assumption that joins will have different aliases and remove this assumption. This might arguably be a cleaner approach but will probably end up being much more complex and harder to implement.
To confirm this theory, we can just redo all the nested queries that failed earlier but this time give them aliases
noisepage=# create table foo (a int);
CREATE TABLE
noisepage=# insert into foo values (1);
INSERT 0 1
noisepage=# insert into foo values (2);
INSERT 0 1
noisepage=# select * from foo as f1 where a in (select a from foo as f2);
a
---
1
2
(2 rows)
That one now gives the correct answer
noisepage=# create table foo (a int, b int);
CREATE TABLE
noisepage=# insert into foo values (1, 2);
INSERT 0 1
noisepage=# insert into foo values (2, 1);
INSERT 0 1
noisepage=# select * from foo as f1 where a in (select b from foo as f2);
a | b
---+---
2 | 1
1 | 2
(2 rows)
This one also now gives the correct answer
noisepage=# create table foo (a int);
CREATE TABLE
noisepage=# insert into foo values (1);
INSERT 0 1
noisepage=# insert into foo values (2);
INSERT 0 1
noisepage=# select * from foo as f1 where a in (select min(a) from foo as f2);
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
This one segfaults, which is consistent with the behavior of the nested query on two different tables with an aggregation in the nested query. This is different than the error we would get without the aliases that @lmwnshn pointed out about AGGREGATE_MIN not being implemented. This suggests to me that there is probably a second issue with aggregates in nested queries.
And finally the TPC-H query that was altered to remove the aggregation (create statements omitted but you can find them in the original issue).
select
s_acctbal,
s_name,
n_name,
p_partkey,
p_mfgr,
s_address,
s_phone,
s_comment
from
part as p1,
supplier as s1,
partsupp as pa1,
nation as n1,
region as r1
where
p_partkey = ps_partkey
and s_suppkey = ps_suppkey
and p_size = 15
and p_type like '%BRASS'
and s_nationkey = n_nationkey
and n_regionkey = r_regionkey
and r_name = 'EUROPE'
and ps_supplycost IN (
select
ps_supplycost
from
partsupp as pa2,
supplier as s2,
nation as n2,
region as r2
where
p_partkey = ps_partkey
and s_suppkey = ps_suppkey
and s_nationkey = n_nationkey
and n_regionkey = r_regionkey
and r_name = 'EUROPE'
)
order by
s_acctbal desc,
n_name,
s_name,
p_partkey
limit 100;
s_acctbal | s_name | n_name | p_partkey | p_mfgr | s_address | s_phone | s_comment
-----------+--------+--------+-----------+--------+-----------+---------+-----------
(0 rows)
That is also the correct answer
Just thinking out loud here for a bit. With the query SELECT * from foo WHERE a in (SELECT a from foo);
We end up with an Op Tree resembling the following (excuse the poor ASCII art)
Logical filter
/ \
predicates |
/ \ |
Compare In table alias |
/ \ set |
Column Column | |
Value Value 'foo' |
'a' from 'a' from |
'foo' 'foo' |
Logical Mark Join - join_predicate = null
/ \
Logical Logical
Get Get
'foo' 'foo'
I don't think it's really possible to just remove the assumption that every table will have a different alias. Here it's impossible to connect the different sides of the Compare In to the different sides of the Join because they all have the same aliases. It's trivial in this example and might not matter because it's the same column, but if they were different columns then it would matter. Unless we just assume that left goes with left and right goes with right.
Something interesting is that it's perfectly valid syntax (though probably not a good idea) to have duplicate aliases in your query, as long as they are in different depths.
For example we can run the following on postgres
postgres=# create table foo (a int);
CREATE TABLE
postgres=# insert into foo values (1);
INSERT 0 1
postgres=# insert into foo values (2);
INSERT 0 1
postgres=# select * from foo as f1 where a in (select a from foo as f1);
a
---
1
2
(2 rows)
On our system we get the same result as if we didn't specify any aliases
noisepage=# create table foo (a int);
CREATE TABLE
noisepage=# insert into foo values (1);
INSERT 0 1
noisepage=# insert into foo values (2);
INSERT 0 1
noisepage=# select * from foo as f1 where a in (select a from foo as f1);
a
---
1
1
2
2
(4 rows)
Which makes sense because we run into the same issue of duplicate aliases. It would be interesting to see how postgres handles this.
So it seems like the only thing we can do without rewriting the optimizer is to assign a unique alias to nested queries that don't have a unique alias (This means also potentially changing aliases due to the issue above). It might be possible to rewrite the optimizer to be able to differentiate between duplicate aliases somehow. Possibly include some additional metadata like what query depth each node came from, but this would be much more involved.
The next question is where should we assign the aliases?
The postgres parser doesn't maintain much state so it would be difficult to track unique indexes there. When we're in the optimizer building the op tree, the table aliases have already propagated to multiple parts of the AST making it a little difficult. A good candidate is the binder in BindNodeVisitor
. Here the table aliases only exist in the TableRef
, and part of the responsibility of this class is to propagate the table aliases to all the different parts of the AST. Additionally we already maintain the table aliases for each depth in the BinderContext
Some potential approaches are
- There is a separate
BinderContext
for each depth of the query, each of which has a reference to theBinderContext
of one depth higher. When visiting theTableRef
we can walk up the contexts and see if the table alias is unique. If not we generate a new one. This can be potentially slow because we'll have to walk up the context when checking for unique aliases and when generating new ones. - Store a set of all table aliases in the
BinderSherpa
(I think there's only one for all depths). When visitingTableRef
check if it's a unique alias. If not generate a new one. This would cause some duplicate info between the sherpa and the context.
This is great. I think we should discuss this at the upcoming Tuesday meeting.
Postgres actually has some documentation on subqueries in the optimizer https://github.com/postgres/postgres/blob/9fe649ea295f00baf6d0f0c1f9b0cb1298f64fb9/src/backend/optimizer/README#L272-L295 The difference is that we always pull up subqueries in to a join, while they only pull up the subqueries under certain circumstances. When the subquery isn't pulled up it's treated as a black box and executed separately (I think).
The actual planning code can be found here (though I haven't looked at it in depth): https://github.com/postgres/postgres/blob/1375422c7826a2bf387be29895e961614f69de4b/src/backend/optimizer/plan/planner.c
This commit generates unique table aliases for any duplicate table aliases. It breaks a couple of tests and has some issues still, but for the most part it works. All the nested queries above that don't have an aggregation in them work on this commit: https://github.com/jkosh44/noisepage/commit/75e9ec212d51cf01c10e1f8bef45ffcfc5412785
Sorry for the spam, but I just want to point out one more thing. Duplicate aliases are not always causing an issue, it seems like when a duplicate alias is in a nested join then that's fine as the following example shows.
postgres=# CREATE TABLE shipment (sno int, pno int, qty int, item_name VARCHAR(32));
CREATE TABLE
postgres=# CREATE TABLE supplier (sno int, quota int);
CREATE TABLE
postgres=# CREATE TABLE part (pno int, amount int, part_name VARCHAR(32));
CREATE TABLE
postgres=# INSERT INTO supplier VALUES (2, 3);
INSERT 0 1
postgres=# INSERT INTO supplier VALUES (3, 100);
INSERT 0 1
postgres=# INSERT INTO supplier VALUES (4, 16);
INSERT 0 1
postgres=# INSERT INTO supplier VALUES (5, 5);
INSERT 0 1
postgres=# INSERT INTO supplier VALUES (6, 0);
INSERT 0 1
postgres=# INSERT INTO shipment VALUES (1, 1, 1, 'item_11');
INSERT 0 1
postgres=# INSERT INTO shipment VALUES (2, 1, 1, 'item_11');
INSERT 0 1
postgres=# INSERT INTO shipment VALUES (3, 1, 1, 'item_11');
INSERT 0 1
postgres=# INSERT INTO shipment VALUES (2, 1, 2, 'item_12');
INSERT 0 1
postgres=# INSERT INTO shipment VALUES (3, 1, 3, 'item_13');
INSERT 0 1
postgres=# INSERT INTO shipment VALUES (3, 2, 2, 'item_21');
INSERT 0 1
postgres=# INSERT INTO shipment VALUES (2, 2, 2, 'item_22');
INSERT 0 1
postgres=# INSERT INTO shipment VALUES (2, 3, 1, 'item_31');
INSERT 0 1
postgres=# INSERT INTO shipment VALUES (3, 3, 10, 'item_32');
INSERT 0 1
postgres=# INSERT INTO shipment VALUES (4, 3, 10, 'item_33');
INSERT 0 1
postgres=# INSERT INTO shipment VALUES (2, 4, 4, 'item_41');
INSERT 0 1
postgres=# INSERT INTO shipment VALUES (4, 4, 6, 'item_42');
INSERT 0 1
postgres=# INSERT INTO shipment VALUES (5, 4, 5, 'item_43');
INSERT 0 1
postgres=# INSERT INTO shipment VALUES (5, 5, 100, 'item_51');
INSERT 0 1
postgres=# INSERT INTO part VALUES (1, 10, 'item_11');
INSERT 0 1
postgres=# INSERT INTO part VALUES (1, 5, 'item_12');
INSERT 0 1
postgres=# INSERT INTO part VALUES (1, 1, 'item_13');
INSERT 0 1
postgres=# INSERT INTO part VALUES (1, 11, 'item_14');
INSERT 0 1
postgres=# INSERT INTO part VALUES (2, 20, 'item_21');
INSERT 0 1
postgres=# INSERT INTO part VALUES (3, 5, 'item_31');
INSERT 0 1
postgres=# INSERT INTO part VALUES (3, 2, 'item_32');
INSERT 0 1
postgres=# INSERT INTO part VALUES (3, 15, 'item_33');
INSERT 0 1
postgres=# INSERT INTO part VALUES (4, 4, 'item_41');
INSERT 0 1
postgres=# INSERT INTO part VALUES (4, 10, 'item_42');
INSERT 0 1
postgres=# INSERT INTO part VALUES (4, 2, 'item_43');
INSERT 0 1
postgres=# INSERT INTO part VALUES (5, 1000, 'item_51');
INSERT 0 1
postgres=# SELECT quota FROM supplier WHERE sno in (SELECT s.sno FROM supplier INNER JOIN shipment s on supplier.sno = s.sno);
quota
-------
3
100
16
5
(4 rows)
noisepage=# CREATE TABLE shipment (sno int, pno int, qty int, item_name VARCHAR(32));
CREATE TABLE
noisepage=# CREATE TABLE supplier (sno int, quota int);
CREATE TABLE
noisepage=# CREATE TABLE part (pno int, amount int, part_name VARCHAR(32));
CREATE TABLE
noisepage=# INSERT INTO supplier VALUES (2, 3);
INSERT 0 1
noisepage=# INSERT INTO supplier VALUES (3, 100);
INSERT 0 1
noisepage=# INSERT INTO supplier VALUES (4, 16);
INSERT 0 1
noisepage=# INSERT INTO supplier VALUES (5, 5);
INSERT 0 1
noisepage=# INSERT INTO supplier VALUES (6, 0);
INSERT 0 1
noisepage=# INSERT INTO shipment VALUES (1, 1, 1, 'item_11');
INSERT 0 1
noisepage=# INSERT INTO shipment VALUES (2, 1, 1, 'item_11');
INSERT 0 1
noisepage=# INSERT INTO shipment VALUES (3, 1, 1, 'item_11');
INSERT 0 1
noisepage=# INSERT INTO shipment VALUES (2, 1, 2, 'item_12');
INSERT 0 1
noisepage=# INSERT INTO shipment VALUES (3, 1, 3, 'item_13');
INSERT 0 1
noisepage=# INSERT INTO shipment VALUES (3, 2, 2, 'item_21');
INSERT 0 1
noisepage=# INSERT INTO shipment VALUES (2, 2, 2, 'item_22');
INSERT 0 1
noisepage=# INSERT INTO shipment VALUES (2, 3, 1, 'item_31');
INSERT 0 1
noisepage=# INSERT INTO shipment VALUES (3, 3, 10, 'item_32');
INSERT 0 1
noisepage=# INSERT INTO shipment VALUES (4, 3, 10, 'item_33');
INSERT 0 1
noisepage=# INSERT INTO shipment VALUES (2, 4, 4, 'item_41');
INSERT 0 1
noisepage=# INSERT INTO shipment VALUES (4, 4, 6, 'item_42');
INSERT 0 1
noisepage=# INSERT INTO shipment VALUES (5, 4, 5, 'item_43');
INSERT 0 1
noisepage=# INSERT INTO shipment VALUES (5, 5, 100, 'item_51');
INSERT 0 1
noisepage=# INSERT INTO part VALUES (1, 10, 'item_11');
INSERT 0 1
noisepage=# INSERT INTO part VALUES (1, 5, 'item_12');
INSERT 0 1
noisepage=# INSERT INTO part VALUES (1, 1, 'item_13');
INSERT 0 1
noisepage=# INSERT INTO part VALUES (1, 11, 'item_14');
INSERT 0 1
noisepage=# INSERT INTO part VALUES (2, 20, 'item_21');
INSERT 0 1
noisepage=# INSERT INTO part VALUES (3, 5, 'item_31');
INSERT 0 1
noisepage=# INSERT INTO part VALUES (3, 2, 'item_32');
INSERT 0 1
noisepage=# INSERT INTO part VALUES (3, 15, 'item_33');
INSERT 0 1
noisepage=# INSERT INTO part VALUES (4, 4, 'item_41');
INSERT 0 1
noisepage=# INSERT INTO part VALUES (4, 10, 'item_42');
INSERT 0 1
noisepage=# INSERT INTO part VALUES (4, 2, 'item_43');
INSERT 0 1
noisepage=# INSERT INTO part VALUES (5, 1000, 'item_51');
INSERT 0 1
noisepage=# SELECT quota FROM supplier WHERE sno in (SELECT s.sno FROM supplier INNER JOIN shipment s on supplier.sno = s.sno);
quota
-------
3
100
16
5
(4 rows)
If we replace all duplicate aliases, then even these which aren't causing issues will be replaced.
So @tanujnay112 has a PR (https://github.com/cmu-db/noisepage/pull/1111) that adds support for duplicate aliases by creating an AliasType
class that includes the alias name and serial number. In order to fix this the branch/PR needs to be extended to support duplicate aliases for table names, which it currently does not.
A couple notes are:
-
TableRef
will need to store aTableAlias
instead of a string for the alias. - Currently the
std::string table_name_
field forColumnValueExpression
actually refers to the table alias and not the underlying name. We'll need to figure out how to make this work once the table alias is a more complex class than a string.
The following branch and commit extends #1111 to fix this issue: https://github.com/jkosh44/noisepage/tree/unique-table-alias. It's a little messy and needs cleaning up, but it's probably not worth doing too much more work on it until #1111 gets updated and merged. So for now I'll block this issue pending the merge of #1111.
Things that need improving in my commit
- ~~There's probably a good amount of unnecessary copying due to my laziness around copy/move semantics and references. This should be fixed.~~ (I think I fixed this)
- ~~I added a couple self explanatory TODOs that should be resolved~~
- ~~The create/delete/update operators (logical and physical) don't really need to use
AliasType
for the table name since aliases aren't allowed in those statements. They can probably get away with just having a string for their table name. For this commit it was easier to give them anAliasType
, but it might be better to try and switch back to a string.~~ - Right now I'm assigning all table aliases a unique serial number regardless of whether or not it's a duplicate alias. I think this is probably simpler than trying to figure out when a serial number is needed and only assigning one sometimes. Though it's probably worth thinking about this a little more.
- I've thought about it, I don't think it's worth it. Feel free to disagree with me.
- ~~I need to update tests and make sure nothing was broken by this commit (i.e. run unit test and SQL traces).~~
- ~~Update commit message to be more descriptive and include example SQL query~~
- ~~Add some comments to explain things better~~
- ~~Classes that switched from having a string to
AliasType
and also implementHash
need to be updated to properly hash theAliasType
. Also figure out how to not hash serial number.~~ - #1111 has a handful of merge conflicts with master, so once those are resolved this commit will also need to be updated.
EDIT: This commit is ready for a PR, just waiting on the CTE PR to be merged, then I'll rebase it off of master and open a PR.
@jkosh44 To be clear, when you say fix this issue, you mean it fixes all of #1229 or only the renaming stuff?
@lmwnshn Sorry looks like I spoke too soon. It solves the renaming issue and all the other example queries I put in this issue. But it doesn't solve everything wrong with this issue.
The issue for this query now, seems to be right here: https://github.com/cmu-db/noisepage/blob/041fe8fb27a5d6db28fa037c20e7b9358e19a108/src/optimizer/optimizer.cpp#L151-L166
It gets stuck in an infinite loop where the task_stack
grows infinitely.
First the task stack goes through and completes a bunch of UNNEST_SUBQUERY
tasks then it goes through and completes a bunch of PREDICATE_PUSH_DOWN
tasks, then it goes through an OTPMIZE_GROUP
and a bunch of DERIVE_STATS
tasks. Then it gets stuck in a loop where the task_stack
continues to grow and shrink, but overall it grows faster than it shrinks. It has the following pattern (0 is the bottom of the stack)
index | task type |
---|---|
0 | OPTIMIZE_INPUTS |
1 | APPLY_RULE |
2 | DERIVE_STATS |
3 | OPTIMIZE_EXPR |
4 | DERIVE_STATS |
5 | OPTIMIZE_EXPR |
6 | DERIVE_STATS |
7 | OPTIMIZE_EXPR |
... | OPTIMIZE_EXPR and DERIVE_STATS continue to alternate |
Occasionally there's some 'APPLY_RULE,
EXPLORE_GROUP, and 'OPTIMIZE_INPUTS
throughout the task_stack
but eventually they're removed and replaced with the alternating pattern. The head of the list is some combination of the three previously mentioned rules as well.
EDIT: Just a note, looking at the operator tree generated now it looks to be fine. There are a lot of inner joins nodes (7 of them), but that's how many are needed to combine all the tables. The one weird thing about the op tree is that the LogicalAggregateAndGroupBy
node doesn't have anything in the columns_
field, but even simple queries such as SELECT MIN(l_orderkey) FROM lineitem;
have an empty column_
field as well.
#1404 Does affect this query, though fixing #1404 doesn't completely fix this query.
This query is also likely suffering from #1414. If you apply the fixes from https://github.com/jkosh44/noisepage/tree/unique-table-alias and #1404, and then run the following query (Same as the original query without the LIMIT
clause):
select
s_acctbal,
s_name,
n_name,
p_partkey,
p_mfgr,
s_address,
s_phone,
s_comment
from
part,
supplier,
partsupp,
nation,
region
where
p_partkey = ps_partkey
and s_suppkey = ps_suppkey
and p_size = 15
and p_type like '%BRASS'
and s_nationkey = n_nationkey
and n_regionkey = r_regionkey
and r_name = 'EUROPE'
and ps_supplycost = (
select
min(ps_supplycost)
from
partsupp,
supplier,
nation,
region
where
p_partkey = ps_partkey
and s_suppkey = ps_suppkey
and s_nationkey = n_nationkey
and n_regionkey = r_regionkey
and r_name = 'EUROPE'
)
order by
s_acctbal desc,
n_name,
s_name,
p_partkey;
Then the query will time out and return a result. If we add the LIMIT back in then it get's stuck in the ExecuteTaskStack
loop and has the same symptoms as #1414.
This makes sense because the nested query gets pulled up into a join, so this is essentially just a large join.