noisepage icon indicating copy to clipboard operation
noisepage copied to clipboard

TPC-H : Generate incorrect operator tree for nested query

Open xinzhu-cai opened this issue 4 years ago • 22 comments

Bug Report

Summary

TPCH Query 2 does not stop executing. The transformer.convertToOpExpression() generates an incorrect op_tree that contains the InnerJoin operator repeatedly (InnerJoin -> InnerJoin -> InnerJoin -> ...). Thus, OptimizeGroup tasks never stop.

Environment

I don't think it's environment-specific. I run on macOS 10.14+ with Clang 8.0+

Steps to Reproduce

  • Compile with the following args: -DCMAKE_BUILD_TYPE=Release ..
  • Run terrier executable
  • Connect to the terrier terminal by executing : psql -h localhost -p 15721 -d terrier
  • Create the tables from "https://github.com/oltpbenchmark/oltpbench/blob/master/src/com/oltpbenchmark/benchmarks/tpch/ddls/tpch-postgres-ddl.sql" .
  • Run the query
-- TPC-H Query 2

select
        s_acctbal,
        s_name,
        n_name,
        p_partkey,
        p_mfgr,
        s_address,
        s_phone,
        s_comment
from
        part,
        supplier,
        partsupp,
        nation,
        region
where
        p_partkey = ps_partkey
        and s_suppkey = ps_suppkey
        and p_size = 15
        and p_type like '%BRASS'
        and s_nationkey = n_nationkey
        and n_regionkey = r_regionkey
        and r_name = 'EUROPE'
        and ps_supplycost = (
                select
                        min(ps_supplycost)
                from
                        partsupp,
                        supplier,
                        nation,
                        region
                where
                        p_partkey = ps_partkey
                        and s_suppkey = ps_suppkey
                        and s_nationkey = n_nationkey
                        and n_regionkey = r_regionkey
                        and r_name = 'EUROPE'
        )
order by
        s_acctbal desc,
        n_name,
        s_name,
        p_partkey
limit 100;

This query never stops running. We expect it to return an empty result.

xinzhu-cai avatar Oct 09 '20 21:10 xinzhu-cai

2020-11-13: this query seems to run (though yes, I think it hits an optimizer timeout).

Patch IndexJoinTranslator to only initialize the table_pm_ if table columns are actually used:

diff --git a/src/execution/compiler/operator/index_join_translator.cpp b/src/execution/compiler/operator/index_join_translator.cpp
index 72277602b..ab8914745 100644
--- a/src/execution/compiler/operator/index_join_translator.cpp
+++ b/src/execution/compiler/operator/index_join_translator.cpp
@@ -21,7 +21,6 @@ IndexJoinTranslator::IndexJoinTranslator(const planner::IndexJoinPlanNode &plan,
     : OperatorTranslator(plan, compilation_context, pipeline, brain::ExecutionOperatingUnitType::DUMMY),
       input_oids_(plan.CollectInputOids()),
       table_schema_(GetCodeGen()->GetCatalogAccessor()->GetSchema(plan.GetTableOid())),
-      table_pm_(GetCodeGen()->GetCatalogAccessor()->GetTable(plan.GetTableOid())->ProjectionMapForOids(input_oids_)),
       index_schema_(GetCodeGen()->GetCatalogAccessor()->GetIndexSchema(plan.GetIndexOid())),
       index_pm_(GetCodeGen()->GetCatalogAccessor()->GetIndex(plan.GetIndexOid())->GetKeyOidToOffsetMap()),
       index_iter_(GetCodeGen()->MakeFreshIdentifier("index_iter")),
@@ -30,6 +29,10 @@ IndexJoinTranslator::IndexJoinTranslator(const planner::IndexJoinPlanNode &plan,
       hi_index_pr_(GetCodeGen()->MakeFreshIdentifier("hi_index_pr")),
       table_pr_(GetCodeGen()->MakeFreshIdentifier("table_pr")),
       slot_(GetCodeGen()->MakeFreshIdentifier("slot")) {
+  if (!input_oids_.empty()) {
+    table_pm_ = GetCodeGen()->GetCatalogAccessor()->GetTable(plan.GetTableOid())->ProjectionMapForOids(input_oids_);
+  }
+
   pipeline->RegisterSource(this, Pipeline::Parallelism::Serial);
   if (plan.GetJoinPredicate() != nullptr) {
     compilation_context->Prepare(*plan.GetJoinPredicate());

However, it chokes on the AGGREGATE_MIN.

terminate called after throwing an instance of 'noisepage::NotImplementedException'
  what():  Code generation for expression type 'AGGREGATE_MIN' not supported.

Here is an easier way to reproduce:

noisepage=# create table foo (a int);
CREATE TABLE
noisepage=# insert into foo values (1);
INSERT 0 1
noisepage=# insert into foo values (2);
INSERT 0 1
noisepage=# select * from foo where a = (select min(a) from foo);
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

I expect that fixing the AGGREGATE_MIN would allow the query to at least run.

lmwnshn avatar Nov 13 '20 11:11 lmwnshn

The following queries work fine


noisepage=# create table foo (a int);
CREATE TABLE
noisepage=# insert into foo values (1);
INSERT 0 1
noisepage=# insert into foo values (2);
INSERT 0 1
noisepage=# select MIN(a) from foo;
 ?column? 
----------
        1
(1 row)

So the error message

terminate called after throwing an instance of 'noisepage::NotImplementedException'
  what():  Code generation for expression type 'AGGREGATE_MIN' not supported.

is probably not accurate

jkosh44 avatar Nov 15 '20 21:11 jkosh44

If you alter the TPC-H query to take the aggregate out of the nested query, the query still never finishes executing. This suggests to me that there are actually two separate issues here.

  1. Something about nested queries with joins are broken
  2. Aggregates in a nested query are broken.

Altered query:

select
        s_acctbal,
        s_name,
        n_name,
        p_partkey,
        p_mfgr,
        s_address,
        s_phone,
        s_comment
from
        part,
        supplier,
        partsupp,
        nation,
        region
where
        p_partkey = ps_partkey
        and s_suppkey = ps_suppkey
        and p_size = 15
        and p_type like '%BRASS'
        and s_nationkey = n_nationkey
        and n_regionkey = r_regionkey
        and r_name = 'EUROPE'
        and ps_supplycost IN (
                select
                        ps_supplycost
                from
                        partsupp,
                        supplier,
                        nation,
                        region
                where
                        p_partkey = ps_partkey
                        and s_suppkey = ps_suppkey
                        and s_nationkey = n_nationkey
                        and n_regionkey = r_regionkey
                        and r_name = 'EUROPE'
        )
order by
        s_acctbal desc,
        n_name,
        s_name,
        p_partkey
limit 100;

jkosh44 avatar Nov 15 '20 22:11 jkosh44

I actually am not sure if it's two separate issues. When taking a look at the physical plan generated by the optimizer for nested queries, there's a lot of things that don't exactly look right. Such as the entire query being converted ~~to an Inner Nested Loop Join with no predicate,~~ with one of the join children being a sequential scan with no output schema. The aggregate error is coming because the aggregate node is in a place where it doesn't really make sense to have an aggregate.

Also other nested queries are breaking, such as the following

noisepage=# create table foo (a int);
CREATE TABLE
noisepage=# insert into foo values (1);
INSERT 0 1
noisepage=# insert into foo values (2);
INSERT 0 1
noisepage=# select * from foo where a in (select a from foo);
 a 
---
 1
 2
 1
 2
(4 rows)

If we run the same query on postgres we get different results

postgres=# create table foo (a int);
CREATE TABLE
postgres=# insert into foo values(1);
INSERT 0 1
postgres=# insert into foo values(2);
INSERT 0 1
postgres=# select * from foo where a IN (select a from foo);
 a 
---
 1
 2
(2 rows)

jkosh44 avatar Nov 16 '20 16:11 jkosh44

Some more strange nested query behavior, I'm not sure if it's related or not.

The following query actually works (which I wouldn't necessarily suspect from my previous comment)

noisepage=# create table foo (a int);
CREATE TABLE
noisepage=# create table foo2 (a int);
CREATE TABLE
noisepage=# insert into foo values (1);
INSERT 0 1
noisepage=# insert into foo2 values (1);
INSERT 0 1
noisepage=# insert into foo2 values (2);
INSERT 0 1
noisepage=# insert into foo values (2);
INSERT 0 1
noisepage=# select * from foo where a in (select a from foo2);
 a 
---
 1
 2
(2 rows)

Also the following query segfaults

noisepage=# create table foo (a int);
CREATE TABLE
noisepage=# create table foo2 (a int);
CREATE TABLE
noisepage=# insert into foo values (1);
INSERT 0 1
noisepage=# insert into foo2 values (1);
INSERT 0 1
noisepage=# insert into foo2 values (2);
INSERT 0 1
noisepage=# insert into foo values (2);
INSERT 0 1
noisepage=# select * from foo where a in (select min(a) from foo2);
server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

jkosh44 avatar Nov 17 '20 13:11 jkosh44

Yet another broken nested query Noisepage:

noisepage=# create table foo (a int, b int);
CREATE TABLE
noisepage=# insert into foo values (1, 2);
INSERT 0 1
noisepage=# insert into foo values (2, 1);
INSERT 0 1
noisepage=# select * from foo where a in (select b from foo);
 a | b 
---+---
(0 rows)

Postgres:

postgres=# create table foo (a int, b int);
CREATE TABLE
postgres=# insert into foo values (1, 2);
INSERT 0 1
postgres=# insert into foo values (2, 1);
INSERT 0 1
postgres=# select * from foo where a in (select b from foo);
 a | b 
---+---
 1 | 2
 2 | 1
(2 rows)

jkosh44 avatar Nov 18 '20 04:11 jkosh44

It seems definitely that when a nested query has the same table on the inside and outside something fails. What I think is happening is the following:

  • The nested query gets converted to a join on the outer query and inner query.
    • For example the query select * from foo where a in (select a from foo); will generate an op tree with a Filter as the root, with a single Mark Join child. The join will have no predicate and two children, both children being a Get from foo. (Check a previous comment to see how this query's output fails).
  • The two tables in the join are exactly the same AND they don't have any table aliases beyond the table name itself.
  • Much of the optimizer code makes the assumption for joins that if the two tables are the same, then at least they will have different aliases.
    • This affects things like how predicates get pushed down and others.
    • For example when we run the op tree mentioned above through the optimizer the predicate gets pushed down to the left side of the join from this code (recall that both sides of the join have the same table alias) https://github.com/cmu-db/noisepage/blob/de5ed288fed474d1f13242c5cce89982b30163de/src/optimizer/rules/rewrite_rules.cpp#L162-L185 That is not correct and at least partially explains why we get a cross product as the result instead of the correct answer (It doesn't matter how we filter the left table, there's no way we end up with the correct answer unless the predicate is moved to the join predicate).
  • Because this assumption isn't held, all sorts of stuff breaks in ways that I'm not quite sure yet.

So one solution might be to check if the two tables are the same and assign them random aliases. Seems a bit hacky but I'll try it out. Another potential solution could be to find everywhere that makes the assumption that joins will have different aliases and remove this assumption. This might arguably be a cleaner approach but will probably end up being much more complex and harder to implement.

jkosh44 avatar Nov 18 '20 05:11 jkosh44

To confirm this theory, we can just redo all the nested queries that failed earlier but this time give them aliases

noisepage=# create table foo (a int);
CREATE TABLE
noisepage=# insert into foo values (1);
INSERT 0 1
noisepage=# insert into foo values (2);
INSERT 0 1
noisepage=# select * from foo as f1 where a in (select a from foo as f2);
 a 
---
 1
 2
(2 rows)

That one now gives the correct answer

noisepage=# create table foo (a int, b int);
CREATE TABLE
noisepage=# insert into foo values (1, 2);
INSERT 0 1
noisepage=# insert into foo values (2, 1);
INSERT 0 1
noisepage=# select * from foo as f1 where a in (select b from foo as f2);
 a | b 
---+---
 2 | 1
 1 | 2
(2 rows)

This one also now gives the correct answer

noisepage=# create table foo (a int);
CREATE TABLE
noisepage=# insert into foo values (1);
INSERT 0 1
noisepage=# insert into foo values (2);
INSERT 0 1
noisepage=# select * from foo as f1 where a in (select min(a) from foo as f2);
server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

This one segfaults, which is consistent with the behavior of the nested query on two different tables with an aggregation in the nested query. This is different than the error we would get without the aliases that @lmwnshn pointed out about AGGREGATE_MIN not being implemented. This suggests to me that there is probably a second issue with aggregates in nested queries.

And finally the TPC-H query that was altered to remove the aggregation (create statements omitted but you can find them in the original issue).

select
        s_acctbal,
        s_name,
        n_name,
        p_partkey,
        p_mfgr,
        s_address,
        s_phone,
        s_comment
from
        part as p1,
        supplier as s1,
        partsupp as pa1,
        nation as n1,
        region as r1
where
        p_partkey = ps_partkey
        and s_suppkey = ps_suppkey
        and p_size = 15
        and p_type like '%BRASS'
        and s_nationkey = n_nationkey
        and n_regionkey = r_regionkey
        and r_name = 'EUROPE'
        and ps_supplycost IN (
                select
                        ps_supplycost
                from
                        partsupp as pa2,
                        supplier as s2,
                        nation as n2,
                        region as r2
                where
                        p_partkey = ps_partkey
                        and s_suppkey = ps_suppkey
                        and s_nationkey = n_nationkey
                        and n_regionkey = r_regionkey
                        and r_name = 'EUROPE'
        )
order by
        s_acctbal desc,
        n_name,
        s_name,
        p_partkey
limit 100;

 s_acctbal | s_name | n_name | p_partkey | p_mfgr | s_address | s_phone | s_comment 
-----------+--------+--------+-----------+--------+-----------+---------+-----------
(0 rows)

That is also the correct answer

jkosh44 avatar Nov 18 '20 14:11 jkosh44

Just thinking out loud here for a bit. With the query SELECT * from foo WHERE a in (SELECT a from foo); We end up with an Op Tree resembling the following (excuse the poor ASCII art)

                                    Logical filter
                                  /               \
                        predicates                |
                        /        \                |
                Compare In        table alias     |
                /        \          set           |
            Column    Column         |            |
            Value       Value       'foo'         |
          'a' from     'a' from                   |
          'foo'         'foo'                     |
                                            Logical Mark Join - join_predicate = null
                                            /           \    
                                        Logical       Logical
                                        Get           Get
                                        'foo'         'foo'

I don't think it's really possible to just remove the assumption that every table will have a different alias. Here it's impossible to connect the different sides of the Compare In to the different sides of the Join because they all have the same aliases. It's trivial in this example and might not matter because it's the same column, but if they were different columns then it would matter. Unless we just assume that left goes with left and right goes with right.

jkosh44 avatar Nov 19 '20 17:11 jkosh44

Something interesting is that it's perfectly valid syntax (though probably not a good idea) to have duplicate aliases in your query, as long as they are in different depths.

For example we can run the following on postgres

postgres=# create table foo (a int);
CREATE TABLE
postgres=# insert into foo values (1);
INSERT 0 1
postgres=# insert into foo values (2);
INSERT 0 1
postgres=# select * from foo as f1 where a in (select a from foo as f1);
 a 
---
 1
 2
(2 rows)

On our system we get the same result as if we didn't specify any aliases

noisepage=# create table foo (a int);
CREATE TABLE
noisepage=# insert into foo values (1);
INSERT 0 1
noisepage=# insert into foo values (2);
INSERT 0 1
noisepage=# select * from foo as f1 where a in (select a from foo as f1);
 a 
---
 1
 1
 2
 2
(4 rows)

Which makes sense because we run into the same issue of duplicate aliases. It would be interesting to see how postgres handles this.

jkosh44 avatar Nov 22 '20 01:11 jkosh44

So it seems like the only thing we can do without rewriting the optimizer is to assign a unique alias to nested queries that don't have a unique alias (This means also potentially changing aliases due to the issue above). It might be possible to rewrite the optimizer to be able to differentiate between duplicate aliases somehow. Possibly include some additional metadata like what query depth each node came from, but this would be much more involved.

The next question is where should we assign the aliases? The postgres parser doesn't maintain much state so it would be difficult to track unique indexes there. When we're in the optimizer building the op tree, the table aliases have already propagated to multiple parts of the AST making it a little difficult. A good candidate is the binder in BindNodeVisitor. Here the table aliases only exist in the TableRef, and part of the responsibility of this class is to propagate the table aliases to all the different parts of the AST. Additionally we already maintain the table aliases for each depth in the BinderContext

Some potential approaches are

  • There is a separate BinderContext for each depth of the query, each of which has a reference to the BinderContext of one depth higher. When visiting the TableRef we can walk up the contexts and see if the table alias is unique. If not we generate a new one. This can be potentially slow because we'll have to walk up the context when checking for unique aliases and when generating new ones.
  • Store a set of all table aliases in the BinderSherpa (I think there's only one for all depths). When visiting TableRef check if it's a unique alias. If not generate a new one. This would cause some duplicate info between the sherpa and the context.

jkosh44 avatar Nov 22 '20 01:11 jkosh44

This is great. I think we should discuss this at the upcoming Tuesday meeting.

lmwnshn avatar Nov 22 '20 03:11 lmwnshn

Postgres actually has some documentation on subqueries in the optimizer https://github.com/postgres/postgres/blob/9fe649ea295f00baf6d0f0c1f9b0cb1298f64fb9/src/backend/optimizer/README#L272-L295 The difference is that we always pull up subqueries in to a join, while they only pull up the subqueries under certain circumstances. When the subquery isn't pulled up it's treated as a black box and executed separately (I think).

The actual planning code can be found here (though I haven't looked at it in depth): https://github.com/postgres/postgres/blob/1375422c7826a2bf387be29895e961614f69de4b/src/backend/optimizer/plan/planner.c

jkosh44 avatar Nov 22 '20 14:11 jkosh44

This commit generates unique table aliases for any duplicate table aliases. It breaks a couple of tests and has some issues still, but for the most part it works. All the nested queries above that don't have an aggregation in them work on this commit: https://github.com/jkosh44/noisepage/commit/75e9ec212d51cf01c10e1f8bef45ffcfc5412785

jkosh44 avatar Nov 22 '20 19:11 jkosh44

Sorry for the spam, but I just want to point out one more thing. Duplicate aliases are not always causing an issue, it seems like when a duplicate alias is in a nested join then that's fine as the following example shows.

postgres=# CREATE TABLE shipment (sno int, pno int, qty int, item_name VARCHAR(32));
CREATE TABLE
postgres=# CREATE TABLE supplier (sno int, quota int);
CREATE TABLE
postgres=# CREATE TABLE part (pno int, amount int, part_name VARCHAR(32));
CREATE TABLE
postgres=# INSERT INTO supplier VALUES (2, 3);
INSERT 0 1
postgres=# INSERT INTO supplier VALUES (3, 100);
INSERT 0 1
postgres=# INSERT INTO supplier VALUES (4, 16);
INSERT 0 1
postgres=# INSERT INTO supplier VALUES (5, 5);
INSERT 0 1
postgres=# INSERT INTO supplier VALUES (6, 0);
INSERT 0 1
postgres=# INSERT INTO shipment VALUES (1, 1, 1, 'item_11');
INSERT 0 1
postgres=# INSERT INTO shipment VALUES (2, 1, 1, 'item_11');
INSERT 0 1
postgres=# INSERT INTO shipment VALUES (3, 1, 1, 'item_11');
INSERT 0 1
postgres=# INSERT INTO shipment VALUES (2, 1, 2, 'item_12');
INSERT 0 1
postgres=# INSERT INTO shipment VALUES (3, 1, 3, 'item_13');
INSERT 0 1
postgres=# INSERT INTO shipment VALUES (3, 2, 2, 'item_21');
INSERT 0 1
postgres=# INSERT INTO shipment VALUES (2, 2, 2, 'item_22');
INSERT 0 1
postgres=# INSERT INTO shipment VALUES (2, 3, 1, 'item_31');
INSERT 0 1
postgres=# INSERT INTO shipment VALUES (3, 3, 10, 'item_32');
INSERT 0 1
postgres=# INSERT INTO shipment VALUES (4, 3, 10, 'item_33');
INSERT 0 1
postgres=# INSERT INTO shipment VALUES (2, 4, 4, 'item_41');
INSERT 0 1
postgres=# INSERT INTO shipment VALUES (4, 4, 6, 'item_42');
INSERT 0 1
postgres=# INSERT INTO shipment VALUES (5, 4, 5, 'item_43');
INSERT 0 1
postgres=# INSERT INTO shipment VALUES (5, 5, 100, 'item_51');
INSERT 0 1
postgres=# INSERT INTO part VALUES (1, 10, 'item_11');
INSERT 0 1
postgres=# INSERT INTO part VALUES (1, 5, 'item_12');
INSERT 0 1
postgres=# INSERT INTO part VALUES (1, 1, 'item_13');
INSERT 0 1
postgres=# INSERT INTO part VALUES (1, 11, 'item_14');
INSERT 0 1
postgres=# INSERT INTO part VALUES (2, 20, 'item_21');
INSERT 0 1
postgres=# INSERT INTO part VALUES (3, 5, 'item_31');
INSERT 0 1
postgres=# INSERT INTO part VALUES (3, 2, 'item_32');
INSERT 0 1
postgres=# INSERT INTO part VALUES (3, 15, 'item_33');
INSERT 0 1
postgres=# INSERT INTO part VALUES (4, 4, 'item_41');
INSERT 0 1
postgres=# INSERT INTO part VALUES (4, 10, 'item_42');
INSERT 0 1
postgres=# INSERT INTO part VALUES (4, 2, 'item_43');
INSERT 0 1
postgres=# INSERT INTO part VALUES (5, 1000, 'item_51');
INSERT 0 1
postgres=# SELECT quota FROM supplier WHERE sno in (SELECT s.sno FROM supplier INNER JOIN shipment s on supplier.sno = s.sno);
 quota 
-------
     3
   100
    16
     5
(4 rows)
noisepage=# CREATE TABLE shipment (sno int, pno int, qty int, item_name VARCHAR(32));
CREATE TABLE
noisepage=# CREATE TABLE supplier (sno int, quota int);
CREATE TABLE
noisepage=# CREATE TABLE part (pno int, amount int, part_name VARCHAR(32));
CREATE TABLE
noisepage=# INSERT INTO supplier VALUES (2, 3);
INSERT 0 1
noisepage=# INSERT INTO supplier VALUES (3, 100);
INSERT 0 1
noisepage=# INSERT INTO supplier VALUES (4, 16);
INSERT 0 1
noisepage=# INSERT INTO supplier VALUES (5, 5);
INSERT 0 1
noisepage=# INSERT INTO supplier VALUES (6, 0);
INSERT 0 1
noisepage=# INSERT INTO shipment VALUES (1, 1, 1, 'item_11');
INSERT 0 1
noisepage=# INSERT INTO shipment VALUES (2, 1, 1, 'item_11');
INSERT 0 1
noisepage=# INSERT INTO shipment VALUES (3, 1, 1, 'item_11');
INSERT 0 1
noisepage=# INSERT INTO shipment VALUES (2, 1, 2, 'item_12');
INSERT 0 1
noisepage=# INSERT INTO shipment VALUES (3, 1, 3, 'item_13');
INSERT 0 1
noisepage=# INSERT INTO shipment VALUES (3, 2, 2, 'item_21');
INSERT 0 1
noisepage=# INSERT INTO shipment VALUES (2, 2, 2, 'item_22');
INSERT 0 1
noisepage=# INSERT INTO shipment VALUES (2, 3, 1, 'item_31');
INSERT 0 1
noisepage=# INSERT INTO shipment VALUES (3, 3, 10, 'item_32');
INSERT 0 1
noisepage=# INSERT INTO shipment VALUES (4, 3, 10, 'item_33');
INSERT 0 1
noisepage=# INSERT INTO shipment VALUES (2, 4, 4, 'item_41');
INSERT 0 1
noisepage=# INSERT INTO shipment VALUES (4, 4, 6, 'item_42');
INSERT 0 1
noisepage=# INSERT INTO shipment VALUES (5, 4, 5, 'item_43');
INSERT 0 1
noisepage=# INSERT INTO shipment VALUES (5, 5, 100, 'item_51');
INSERT 0 1
noisepage=# INSERT INTO part VALUES (1, 10, 'item_11');
INSERT 0 1
noisepage=# INSERT INTO part VALUES (1, 5, 'item_12');
INSERT 0 1
noisepage=# INSERT INTO part VALUES (1, 1, 'item_13');
INSERT 0 1
noisepage=# INSERT INTO part VALUES (1, 11, 'item_14');
INSERT 0 1
noisepage=# INSERT INTO part VALUES (2, 20, 'item_21');
INSERT 0 1
noisepage=# INSERT INTO part VALUES (3, 5, 'item_31');
INSERT 0 1
noisepage=# INSERT INTO part VALUES (3, 2, 'item_32');
INSERT 0 1
noisepage=# INSERT INTO part VALUES (3, 15, 'item_33');
INSERT 0 1
noisepage=# INSERT INTO part VALUES (4, 4, 'item_41');
INSERT 0 1
noisepage=# INSERT INTO part VALUES (4, 10, 'item_42');
INSERT 0 1
noisepage=# INSERT INTO part VALUES (4, 2, 'item_43');
INSERT 0 1
noisepage=# INSERT INTO part VALUES (5, 1000, 'item_51');
INSERT 0 1
noisepage=# SELECT quota FROM supplier WHERE sno in (SELECT s.sno FROM supplier INNER JOIN shipment s on supplier.sno = s.sno);
 quota 
-------
     3
   100
    16
     5
(4 rows)

If we replace all duplicate aliases, then even these which aren't causing issues will be replaced.

jkosh44 avatar Nov 22 '20 19:11 jkosh44

So @tanujnay112 has a PR (https://github.com/cmu-db/noisepage/pull/1111) that adds support for duplicate aliases by creating an AliasType class that includes the alias name and serial number. In order to fix this the branch/PR needs to be extended to support duplicate aliases for table names, which it currently does not.

A couple notes are:

  1. TableRef will need to store a TableAlias instead of a string for the alias.
  2. Currently the std::string table_name_ field for ColumnValueExpression actually refers to the table alias and not the underlying name. We'll need to figure out how to make this work once the table alias is a more complex class than a string.

jkosh44 avatar Nov 30 '20 03:11 jkosh44

The following branch and commit extends #1111 to fix this issue: https://github.com/jkosh44/noisepage/tree/unique-table-alias. It's a little messy and needs cleaning up, but it's probably not worth doing too much more work on it until #1111 gets updated and merged. So for now I'll block this issue pending the merge of #1111.

Things that need improving in my commit

  • ~~There's probably a good amount of unnecessary copying due to my laziness around copy/move semantics and references. This should be fixed.~~ (I think I fixed this)
  • ~~I added a couple self explanatory TODOs that should be resolved~~
  • ~~The create/delete/update operators (logical and physical) don't really need to use AliasType for the table name since aliases aren't allowed in those statements. They can probably get away with just having a string for their table name. For this commit it was easier to give them an AliasType, but it might be better to try and switch back to a string.~~
  • Right now I'm assigning all table aliases a unique serial number regardless of whether or not it's a duplicate alias. I think this is probably simpler than trying to figure out when a serial number is needed and only assigning one sometimes. Though it's probably worth thinking about this a little more.
    • I've thought about it, I don't think it's worth it. Feel free to disagree with me.
  • ~~I need to update tests and make sure nothing was broken by this commit (i.e. run unit test and SQL traces).~~
  • ~~Update commit message to be more descriptive and include example SQL query~~
  • ~~Add some comments to explain things better~~
  • ~~Classes that switched from having a string to AliasType and also implement Hash need to be updated to properly hash the AliasType. Also figure out how to not hash serial number.~~
  • #1111 has a handful of merge conflicts with master, so once those are resolved this commit will also need to be updated.

EDIT: This commit is ready for a PR, just waiting on the CTE PR to be merged, then I'll rebase it off of master and open a PR.

jkosh44 avatar Dec 05 '20 05:12 jkosh44

@jkosh44 To be clear, when you say fix this issue, you mean it fixes all of #1229 or only the renaming stuff?

lmwnshn avatar Dec 05 '20 12:12 lmwnshn

@lmwnshn Sorry looks like I spoke too soon. It solves the renaming issue and all the other example queries I put in this issue. But it doesn't solve everything wrong with this issue.

jkosh44 avatar Dec 05 '20 19:12 jkosh44

The issue for this query now, seems to be right here: https://github.com/cmu-db/noisepage/blob/041fe8fb27a5d6db28fa037c20e7b9358e19a108/src/optimizer/optimizer.cpp#L151-L166 It gets stuck in an infinite loop where the task_stack grows infinitely.

First the task stack goes through and completes a bunch of UNNEST_SUBQUERY tasks then it goes through and completes a bunch of PREDICATE_PUSH_DOWN tasks, then it goes through an OTPMIZE_GROUP and a bunch of DERIVE_STATS tasks. Then it gets stuck in a loop where the task_stack continues to grow and shrink, but overall it grows faster than it shrinks. It has the following pattern (0 is the bottom of the stack)

index task type
0 OPTIMIZE_INPUTS
1 APPLY_RULE
2 DERIVE_STATS
3 OPTIMIZE_EXPR
4 DERIVE_STATS
5 OPTIMIZE_EXPR
6 DERIVE_STATS
7 OPTIMIZE_EXPR
... OPTIMIZE_EXPR and DERIVE_STATS continue to alternate

Occasionally there's some 'APPLY_RULE, EXPLORE_GROUP, and 'OPTIMIZE_INPUTS throughout the task_stack but eventually they're removed and replaced with the alternating pattern. The head of the list is some combination of the three previously mentioned rules as well.

EDIT: Just a note, looking at the operator tree generated now it looks to be fine. There are a lot of inner joins nodes (7 of them), but that's how many are needed to combine all the tables. The one weird thing about the op tree is that the LogicalAggregateAndGroupBy node doesn't have anything in the columns_ field, but even simple queries such as SELECT MIN(l_orderkey) FROM lineitem; have an empty column_ field as well.

jkosh44 avatar Dec 20 '20 19:12 jkosh44

#1404 Does affect this query, though fixing #1404 doesn't completely fix this query.

jkosh44 avatar Dec 23 '20 19:12 jkosh44

This query is also likely suffering from #1414. If you apply the fixes from https://github.com/jkosh44/noisepage/tree/unique-table-alias and #1404, and then run the following query (Same as the original query without the LIMIT clause):

select
        s_acctbal,
        s_name,
        n_name,
        p_partkey,
        p_mfgr,
        s_address,
        s_phone,
        s_comment
from
        part,
        supplier,
        partsupp,
        nation,
        region
where
        p_partkey = ps_partkey
        and s_suppkey = ps_suppkey
        and p_size = 15
        and p_type like '%BRASS'
        and s_nationkey = n_nationkey
        and n_regionkey = r_regionkey
        and r_name = 'EUROPE'
        and ps_supplycost = (
                select
                        min(ps_supplycost)
                from
                        partsupp,
                        supplier,
                        nation,
                        region
                where
                        p_partkey = ps_partkey
                        and s_suppkey = ps_suppkey
                        and s_nationkey = n_nationkey
                        and n_regionkey = r_regionkey
                        and r_name = 'EUROPE'
        )
order by
        s_acctbal desc,
        n_name,
        s_name,
        p_partkey;

Then the query will time out and return a result. If we add the LIMIT back in then it get's stuck in the ExecuteTaskStack loop and has the same symptoms as #1414.

This makes sense because the nested query gets pulled up into a join, so this is essentially just a large join.

jkosh44 avatar Dec 24 '20 03:12 jkosh44