blazingsql icon indicating copy to clipboard operation
blazingsql copied to clipboard

[FEA] Support aliases in aggregation clauses

Open beckernick opened this issue 3 years ago • 4 comments

I'd like to be able to do a GROUP BY aggregation and then filter the resulting aggregated data based on a condition using the HAVING clause while including the aggregation column. In the 2020-11-09 nightlies, this looks like it's not supported. Based on the Calcite reference It looks like this could be possible.

One use case might be doing a group count and filtering to only groups with count greater than one.

from blazingsql import BlazingContext
import cudf
from dask.datasets import timeseries
import pandas as pd
​
# bc = BlazingContext()
​
df = pd.DataFrame({
    "a": ["a", "b","c","a","b","b"],
    "b": [10, 9, 5, 3, 2, 4],
})
​
bc.create_table("df", df)
​
query = """
SELECT
    COUNT(b) as CNT
FROM df
GROUP BY a
HAVING CNT > 1
"""
​
print(bc.explain(query))
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201109/lib/python3.7/site-packages/_jpype.cpython-37m-x86_64-linux-gnu.so in com.blazingdb.calcite.application.RelationalAlgebraGenerator.getRelationalAlgebraString()

/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201109/lib/python3.7/site-packages/_jpype.cpython-37m-x86_64-linux-gnu.so in com.blazingdb.calcite.application.RelationalAlgebraGenerator.getRelationalAlgebra()

/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201109/lib/python3.7/site-packages/_jpype.cpython-37m-x86_64-linux-gnu.so in com.blazingdb.calcite.application.RelationalAlgebraGenerator.getNonOptimizedRelationalAlgebra()

/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201109/lib/python3.7/site-packages/_jpype.cpython-37m-x86_64-linux-gnu.so in com.blazingdb.calcite.application.RelationalAlgebraGenerator.validateQuery()

Exception: Java Exception

The above exception was the direct cause of the following exception:

com.blazingdb.calcite.application.SqlValidationExceptionTraceback (most recent call last)
/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201109/lib/python3.7/site-packages/pyblazing/apiv2/context.py in explain(self, sql)
   1711         try:
-> 1712             algebra = self.generator.getRelationalAlgebraString(sql)
   1713 

com.blazingdb.calcite.application.SqlValidationException: com.blazingdb.calcite.application.SqlValidationException: Column 'CNT' not found in any table

During handling of the above exception, another exception occurred:

Exception                                 Traceback (most recent call last)
<ipython-input-64-6f215191bda4> in <module>
     21 """
     22 
---> 23 print(bc.explain(query))
     24 print(bc.sql(query))

/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201109/lib/python3.7/site-packages/pyblazing/apiv2/context.py in explain(self, sql)
   1714         except SqlValidationExceptionClass as exception:
   1715             # jpype.JException as exception:
-> 1716             raise Exception(exception.message())
   1717             # algebra = ""
   1718             # print("SQL Parsing Error")

Exception: Column 'CNT' not found in any table
from pyspark.sql import SparkSession
​
spark = SparkSession.builder \
    .master("local") \
    .getOrCreate()
​
​sdf = spark.createDataFrame(df)
sdf.createTempView("sdf")
​
query = """
SELECT
    COUNT(b) as CNT
FROM sdf
GROUP BY a
HAVING CNT > 1
"""
​
spark.sql(query).show()
+---+
|CNT|
+---+
|  3|
|  2|
+---+
conda list | grep blazingsql
blazingsql                0.17.0a         cuda10.2_py37_130    blazingsql-nightly/label/cuda10.2

beckernick avatar Nov 10 '20 16:11 beckernick

To clarify, I can do this without "selecting" the aggregation column, but i'd like to be able to select and use it.

from blazingsql import BlazingContext
import cudf
from dask.datasets import timeseries
import pandas as pd
​
# bc = BlazingContext()
​
df = pd.DataFrame({
    "a": ["a", "b","c","a","b","b"],
    "b": [10, 9, 5, 3, 2, 4],
    "c": pd.date_range("2020-11-04", "2020-11-09")
})
​
bc.create_table("df", df)
​
query = """
SELECT
    a
FROM df
GROUP BY a
HAVING COUNT(b) > 1
"""
​
print(bc.explain(query))
print(bc.sql(query))
LogicalProject(a=[$0])
  LogicalFilter(condition=[>($1, 1)])
    LogicalAggregate(group=[{0}], agg#0=[COUNT($1)])
      BindableTableScan(table=[[main, df]], projects=[[0, 1]], aliases=[[a, b]])

   a
0  b
1  a

beckernick avatar Nov 10 '20 17:11 beckernick

We have observed that Calcite sometimes has issues with aliases. What you want to do instead of this:

SELECT
    COUNT(b) as CNT
FROM sdf
GROUP BY a
HAVING CNT > 1

is do this:

SELECT
    COUNT(b) as CNT
FROM sdf
GROUP BY a
HAVING COUNT(b) > 1

To resolve this issue we would need to dig deep into the logic in Calcite which is a little trickier. To improve the general user experience this is something that we definitely want to address in the long run, but since there is a workaround, we are considering lower priority

wmalpica avatar Nov 13 '20 19:11 wmalpica

Thanks for the context 👍

beckernick avatar Nov 13 '20 20:11 beckernick

We should spend up to 3-4 hours investigating how to fix this. Potentially asking in Apache Calcite message boards. If we cant arrive at a solution within that time, we will have to postpone this

wmalpica avatar Dec 10 '20 20:12 wmalpica