datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

TPC-H Query 15

Open alamb opened this issue 3 years ago • 4 comments

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-11528

CREATE VIEW/multiple statement support: "The context currently only supports a single SQL statement"

{{Error: NotImplemented("The context currently only supports a single SQL statement")}}

alamb avatar Apr 26 '21 13:04 alamb

Latest failure:

Error: NotImplemented("The context currently only supports a single SQL statement")

andygrove avatar May 10 '22 22:05 andygrove

I'm looking into this issue and seeing a new error message:

Error: NotImplemented("Unsupported SQL statement: Some(\"CREATE VIEW revenue0 (supplier_no, total_revenue) AS SELECT l_suppkey, sum(l_extendedprice * (1 - l_discount)) FROM lineitem WHERE l_shipdate >= DATE '1996-01-01' AND l_shipdate < DATE '1996-01-01' + INTERVAL '3' MONTH GROUP BY l_suppkey\")")

This comes from missing a match case's if statement at datafusion/sql/src/planner.rs: line 192. Query 15's defined aliases (supplier_no, total_revenue) are populating the columns, thus, columns.is_empty() returns false. Removing the check for empty columns allows the view to be generated, but new issues arise when the second SQL statement in Q15 attempts to query the view via:

select
	s_suppkey,
	s_name,
	s_address,
	s_phone,
	total_revenue
from
	supplier,
	revenue0

where 'total_revenue' is a column in 'revenue0' which is the view.

The table default_catalog.default_schema.revenue0 can't be found in session context via the code path: datafusion/sql/src/planner.rs: line 684 -> datafusion/core/src/execution/context.rs: line 1539

This doesn't currently generate useful error output, but debugging led me here.

Before moving forward too much further, I'd like context on the columns.is_empty() check if possible. I feel I may be missing something here.

DaltonModlin avatar Aug 09 '22 19:08 DaltonModlin

git blame https://github.com/apache/arrow-datafusion/blame/master/datafusion/sql/src/planner.rs#L185-L198

seems to show it came in via the initial support for views in https://github.com/apache/arrow-datafusion/pull/2279 by @matthewmturner

alamb avatar Aug 09 '22 19:08 alamb

Removing the check for empty columns allows the view to be generated

@DaltonModlin if the check is removed, do the other tests still pass? If so, perhaps a new test (with aliases) and a new PR targeted at this one isolated change is a good next step. Folks like small PRs around here ;)

avantgardnerio avatar Aug 11 '22 01:08 avantgardnerio

As mentioned in #3266 I believe there are currently 3 more issues to be solved for a successful Q15 run.

  1. The current failure in which the view created in the first part doesn't populate the context's table names hashmap before the select portion's logical plan is being built. I'll create a new issue and mention this issue for easier reference.
  2. #3266
  3. #3267

DaltonModlin avatar Aug 26 '22 16:08 DaltonModlin

This runs now, but returns 0 results most of the time. The way views are executed still looks a little funny to me, so to exclude the possibility of some bug in the view code, I converted it to a with statement like below:

with revenue as (
	select
		l_suppkey as supplier_no,
		sum(l_extendedprice * (1 - l_discount)) as total_revenue
	from
		lineitem
	where
		l_shipdate >= date '1996-01-01'
		and l_shipdate < date '1996-01-01' + interval '3' month
	group by
		l_suppkey)
select
	s_suppkey,
	s_name,
	s_address,
	s_phone,
	total_revenue
from
	supplier,
	revenue
where
	s_suppkey = supplier_no
	and total_revenue = (
		select
			max(total_revenue)
		from
			revenue
	)
order by
	s_suppkey;

Running this query back to back multiple times usually returns 0 results, but sometimes it correctly returns the top supplier as it's supposed to -- 1 result:

+-----------+--------------------+-------------------+-----------------+--------------------+
| s_suppkey | s_name             | s_address         | s_phone         | total_revenue      |
+-----------+--------------------+-------------------+-----------------+--------------------+
| 8449      | Supplier#000008449 | Wp34zim9qYFbVctdW | 20-469-856-8873 | 1772627.2086999998 |
+-----------+--------------------+-------------------+-----------------+--------------------+
1 row in set. Query took 3.363 seconds.

I extracted just the WITH section to see what it's returning. And 2 back to back runs of this query shows different results for total_revenue:

with revenue as (
        select
                l_suppkey as supplier_no,
                sum(l_extendedprice * (1 - l_discount)) as total_revenue
        from
                lineitem
        where
                l_shipdate >= date '1996-01-01'
                and l_shipdate < date '1996-01-01' + interval '3' month
        group by
                l_suppkey) select * from revenue order by 2 desc limit 1;
+-------------+---------------+
| supplier_no | total_revenue |
+-------------+---------------+
| 8449        | 1772627.2087  |
+-------------+---------------+
1 row in set. Query took 2.959 seconds.

then

+-------------+--------------------+
| supplier_no | total_revenue      |
+-------------+--------------------+
| 8449        | 1772627.2086999998 |
+-------------+--------------------+
1 row in set. Query took 2.554 seconds.

I understand floating point results are uncertain. Is that what's going on here?

kmitchener avatar Sep 05 '22 15:09 kmitchener

I think for this to be correct we should switch to using a decimal type, as using floating point will result in small rounding errors, because parallel execution does not have deterministic ordering.

Dandandan avatar Sep 05 '22 18:09 Dandandan