datafusion
datafusion copied to clipboard
TPC-H Query 15
Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-11528
CREATE VIEW/multiple statement support: "The context currently only supports a single SQL statement"
{{Error: NotImplemented("The context currently only supports a single SQL statement")}}
Latest failure:
Error: NotImplemented("The context currently only supports a single SQL statement")
I'm looking into this issue and seeing a new error message:
Error: NotImplemented("Unsupported SQL statement: Some(\"CREATE VIEW revenue0 (supplier_no, total_revenue) AS SELECT l_suppkey, sum(l_extendedprice * (1 - l_discount)) FROM lineitem WHERE l_shipdate >= DATE '1996-01-01' AND l_shipdate < DATE '1996-01-01' + INTERVAL '3' MONTH GROUP BY l_suppkey\")")
This comes from missing a match case's if statement at datafusion/sql/src/planner.rs: line 192
. Query 15's defined aliases (supplier_no, total_revenue) are populating the columns, thus, columns.is_empty()
returns false. Removing the check for empty columns allows the view to be generated, but new issues arise when the second SQL statement in Q15 attempts to query the view via:
select
s_suppkey,
s_name,
s_address,
s_phone,
total_revenue
from
supplier,
revenue0
where 'total_revenue' is a column in 'revenue0' which is the view.
The table default_catalog.default_schema.revenue0 can't be found in session context via the code path:
datafusion/sql/src/planner.rs: line 684 -> datafusion/core/src/execution/context.rs: line 1539
This doesn't currently generate useful error output, but debugging led me here.
Before moving forward too much further, I'd like context on the columns.is_empty() check if possible. I feel I may be missing something here.
git blame https://github.com/apache/arrow-datafusion/blame/master/datafusion/sql/src/planner.rs#L185-L198
seems to show it came in via the initial support for views in https://github.com/apache/arrow-datafusion/pull/2279 by @matthewmturner
Removing the check for empty columns allows the view to be generated
@DaltonModlin if the check is removed, do the other tests still pass? If so, perhaps a new test (with aliases) and a new PR targeted at this one isolated change is a good next step. Folks like small PRs around here ;)
As mentioned in #3266 I believe there are currently 3 more issues to be solved for a successful Q15 run.
- The current failure in which the view created in the first part doesn't populate the context's table names hashmap before the select portion's logical plan is being built. I'll create a new issue and mention this issue for easier reference.
- #3266
- #3267
This runs now, but returns 0 results most of the time. The way views are executed still looks a little funny to me, so to exclude the possibility of some bug in the view code, I converted it to a with statement like below:
with revenue as (
select
l_suppkey as supplier_no,
sum(l_extendedprice * (1 - l_discount)) as total_revenue
from
lineitem
where
l_shipdate >= date '1996-01-01'
and l_shipdate < date '1996-01-01' + interval '3' month
group by
l_suppkey)
select
s_suppkey,
s_name,
s_address,
s_phone,
total_revenue
from
supplier,
revenue
where
s_suppkey = supplier_no
and total_revenue = (
select
max(total_revenue)
from
revenue
)
order by
s_suppkey;
Running this query back to back multiple times usually returns 0 results, but sometimes it correctly returns the top supplier as it's supposed to -- 1 result:
+-----------+--------------------+-------------------+-----------------+--------------------+
| s_suppkey | s_name | s_address | s_phone | total_revenue |
+-----------+--------------------+-------------------+-----------------+--------------------+
| 8449 | Supplier#000008449 | Wp34zim9qYFbVctdW | 20-469-856-8873 | 1772627.2086999998 |
+-----------+--------------------+-------------------+-----------------+--------------------+
1 row in set. Query took 3.363 seconds.
I extracted just the WITH section to see what it's returning. And 2 back to back runs of this query shows different results for total_revenue
:
with revenue as (
select
l_suppkey as supplier_no,
sum(l_extendedprice * (1 - l_discount)) as total_revenue
from
lineitem
where
l_shipdate >= date '1996-01-01'
and l_shipdate < date '1996-01-01' + interval '3' month
group by
l_suppkey) select * from revenue order by 2 desc limit 1;
+-------------+---------------+
| supplier_no | total_revenue |
+-------------+---------------+
| 8449 | 1772627.2087 |
+-------------+---------------+
1 row in set. Query took 2.959 seconds.
then
+-------------+--------------------+
| supplier_no | total_revenue |
+-------------+--------------------+
| 8449 | 1772627.2086999998 |
+-------------+--------------------+
1 row in set. Query took 2.554 seconds.
I understand floating point results are uncertain. Is that what's going on here?
I think for this to be correct we should switch to using a decimal
type, as using floating point will result in small rounding errors, because parallel execution does not have deterministic ordering.