blazingsql [FEA] Support aliases in aggregation clauses

I'd like to be able to do a GROUP BY aggregation and then filter the resulting aggregated data based on a condition using the HAVING clause while including the aggregation column. In the 2020-11-09 nightlies, this looks like it's not supported. Based on the Calcite reference It looks like this could be possible.

One use case might be doing a group count and filtering to only groups with count greater than one.

from blazingsql import BlazingContext
import cudf
from dask.datasets import timeseries
import pandas as pd

# bc = BlazingContext()

df = pd.DataFrame({
    "a": ["a", "b","c","a","b","b"],
    "b": [10, 9, 5, 3, 2, 4],
})

bc.create_table("df", df)

query = """
SELECT
    COUNT(b) as CNT
FROM df
GROUP BY a
HAVING CNT > 1
"""

print(bc.explain(query))
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201109/lib/python3.7/site-packages/_jpype.cpython-37m-x86_64-linux-gnu.so in com.blazingdb.calcite.application.RelationalAlgebraGenerator.getRelationalAlgebraString()

/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201109/lib/python3.7/site-packages/_jpype.cpython-37m-x86_64-linux-gnu.so in com.blazingdb.calcite.application.RelationalAlgebraGenerator.getRelationalAlgebra()

/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201109/lib/python3.7/site-packages/_jpype.cpython-37m-x86_64-linux-gnu.so in com.blazingdb.calcite.application.RelationalAlgebraGenerator.getNonOptimizedRelationalAlgebra()

/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201109/lib/python3.7/site-packages/_jpype.cpython-37m-x86_64-linux-gnu.so in com.blazingdb.calcite.application.RelationalAlgebraGenerator.validateQuery()

Exception: Java Exception

The above exception was the direct cause of the following exception:

com.blazingdb.calcite.application.SqlValidationExceptionTraceback (most recent call last)
/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201109/lib/python3.7/site-packages/pyblazing/apiv2/context.py in explain(self, sql)
   1711         try:
-> 1712             algebra = self.generator.getRelationalAlgebraString(sql)
   1713 

com.blazingdb.calcite.application.SqlValidationException: com.blazingdb.calcite.application.SqlValidationException: Column 'CNT' not found in any table

During handling of the above exception, another exception occurred:

Exception                                 Traceback (most recent call last)
<ipython-input-64-6f215191bda4> in <module>
     21 """
     22 
---> 23 print(bc.explain(query))
     24 print(bc.sql(query))

/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201109/lib/python3.7/site-packages/pyblazing/apiv2/context.py in explain(self, sql)
   1714         except SqlValidationExceptionClass as exception:
   1715             # jpype.JException as exception:
-> 1716             raise Exception(exception.message())
   1717             # algebra = ""
   1718             # print("SQL Parsing Error")

Exception: Column 'CNT' not found in any table

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("local") \
    .getOrCreate()

sdf = spark.createDataFrame(df)
sdf.createTempView("sdf")

query = """
SELECT
    COUNT(b) as CNT
FROM sdf
GROUP BY a
HAVING CNT > 1
"""

spark.sql(query).show()
+---+
|CNT|
+---+
|  3|
|  2|
+---+

conda list | grep blazingsql
blazingsql                0.17.0a         cuda10.2_py37_130    blazingsql-nightly/label/cuda10.2

Nov 10 '20 16:11 beckernick

To clarify, I can do this without "selecting" the aggregation column, but i'd like to be able to select and use it.

from blazingsql import BlazingContext
import cudf
from dask.datasets import timeseries
import pandas as pd

# bc = BlazingContext()

df = pd.DataFrame({
    "a": ["a", "b","c","a","b","b"],
    "b": [10, 9, 5, 3, 2, 4],
    "c": pd.date_range("2020-11-04", "2020-11-09")
})

bc.create_table("df", df)

query = """
SELECT
    a
FROM df
GROUP BY a
HAVING COUNT(b) > 1
"""

print(bc.explain(query))
print(bc.sql(query))
LogicalProject(a=[$0])
  LogicalFilter(condition=[>($1, 1)])
    LogicalAggregate(group=[{0}], agg#0=[COUNT($1)])
      BindableTableScan(table=[[main, df]], projects=[[0, 1]], aliases=[[a, b]])

   a
0  b
1  a

Nov 10 '20 17:11 beckernick

We have observed that Calcite sometimes has issues with aliases. What you want to do instead of this:

SELECT
    COUNT(b) as CNT
FROM sdf
GROUP BY a
HAVING CNT > 1

is do this:

SELECT
    COUNT(b) as CNT
FROM sdf
GROUP BY a
HAVING COUNT(b) > 1

To resolve this issue we would need to dig deep into the logic in Calcite which is a little trickier. To improve the general user experience this is something that we definitely want to address in the long run, but since there is a workaround, we are considering lower priority

Nov 13 '20 19:11 wmalpica

Thanks for the context 👍

Nov 13 '20 20:11 beckernick

We should spend up to 3-4 hours investigating how to fix this. Potentially asking in Apache Calcite message boards. If we cant arrive at a solution within that time, we will have to postpone this

Dec 10 '20 20:12 wmalpica

blazingsql blazingsql copied to clipboard

[FEA] Support aliases in aggregation clauses

blazingsql
blazingsql copied to clipboard