blazingsql
blazingsql copied to clipboard
[FEA] Support aliases in aggregation clauses
I'd like to be able to do a GROUP BY aggregation and then filter the resulting aggregated data based on a condition using the HAVING clause while including the aggregation column. In the 2020-11-09 nightlies, this looks like it's not supported. Based on the Calcite reference It looks like this could be possible.
One use case might be doing a group count and filtering to only groups with count greater than one.
from blazingsql import BlazingContext
import cudf
from dask.datasets import timeseries
import pandas as pd
# bc = BlazingContext()
df = pd.DataFrame({
"a": ["a", "b","c","a","b","b"],
"b": [10, 9, 5, 3, 2, 4],
})
bc.create_table("df", df)
query = """
SELECT
COUNT(b) as CNT
FROM df
GROUP BY a
HAVING CNT > 1
"""
print(bc.explain(query))
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201109/lib/python3.7/site-packages/_jpype.cpython-37m-x86_64-linux-gnu.so in com.blazingdb.calcite.application.RelationalAlgebraGenerator.getRelationalAlgebraString()
/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201109/lib/python3.7/site-packages/_jpype.cpython-37m-x86_64-linux-gnu.so in com.blazingdb.calcite.application.RelationalAlgebraGenerator.getRelationalAlgebra()
/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201109/lib/python3.7/site-packages/_jpype.cpython-37m-x86_64-linux-gnu.so in com.blazingdb.calcite.application.RelationalAlgebraGenerator.getNonOptimizedRelationalAlgebra()
/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201109/lib/python3.7/site-packages/_jpype.cpython-37m-x86_64-linux-gnu.so in com.blazingdb.calcite.application.RelationalAlgebraGenerator.validateQuery()
Exception: Java Exception
The above exception was the direct cause of the following exception:
com.blazingdb.calcite.application.SqlValidationExceptionTraceback (most recent call last)
/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201109/lib/python3.7/site-packages/pyblazing/apiv2/context.py in explain(self, sql)
1711 try:
-> 1712 algebra = self.generator.getRelationalAlgebraString(sql)
1713
com.blazingdb.calcite.application.SqlValidationException: com.blazingdb.calcite.application.SqlValidationException: Column 'CNT' not found in any table
During handling of the above exception, another exception occurred:
Exception Traceback (most recent call last)
<ipython-input-64-6f215191bda4> in <module>
21 """
22
---> 23 print(bc.explain(query))
24 print(bc.sql(query))
/raid/nicholasb/miniconda3/envs/rapids-tpcxbb-20201109/lib/python3.7/site-packages/pyblazing/apiv2/context.py in explain(self, sql)
1714 except SqlValidationExceptionClass as exception:
1715 # jpype.JException as exception:
-> 1716 raise Exception(exception.message())
1717 # algebra = ""
1718 # print("SQL Parsing Error")
Exception: Column 'CNT' not found in any table
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("local") \
.getOrCreate()
sdf = spark.createDataFrame(df)
sdf.createTempView("sdf")
query = """
SELECT
COUNT(b) as CNT
FROM sdf
GROUP BY a
HAVING CNT > 1
"""
spark.sql(query).show()
+---+
|CNT|
+---+
| 3|
| 2|
+---+
conda list | grep blazingsql
blazingsql 0.17.0a cuda10.2_py37_130 blazingsql-nightly/label/cuda10.2
To clarify, I can do this without "selecting" the aggregation column, but i'd like to be able to select and use it.
from blazingsql import BlazingContext
import cudf
from dask.datasets import timeseries
import pandas as pd
# bc = BlazingContext()
df = pd.DataFrame({
"a": ["a", "b","c","a","b","b"],
"b": [10, 9, 5, 3, 2, 4],
"c": pd.date_range("2020-11-04", "2020-11-09")
})
bc.create_table("df", df)
query = """
SELECT
a
FROM df
GROUP BY a
HAVING COUNT(b) > 1
"""
print(bc.explain(query))
print(bc.sql(query))
LogicalProject(a=[$0])
LogicalFilter(condition=[>($1, 1)])
LogicalAggregate(group=[{0}], agg#0=[COUNT($1)])
BindableTableScan(table=[[main, df]], projects=[[0, 1]], aliases=[[a, b]])
a
0 b
1 a
We have observed that Calcite sometimes has issues with aliases. What you want to do instead of this:
SELECT
COUNT(b) as CNT
FROM sdf
GROUP BY a
HAVING CNT > 1
is do this:
SELECT
COUNT(b) as CNT
FROM sdf
GROUP BY a
HAVING COUNT(b) > 1
To resolve this issue we would need to dig deep into the logic in Calcite which is a little trickier. To improve the general user experience this is something that we definitely want to address in the long run, but since there is a workaround, we are considering lower priority
Thanks for the context 👍
We should spend up to 3-4 hours investigating how to fix this. Potentially asking in Apache Calcite message boards. If we cant arrive at a solution within that time, we will have to postpone this