druid
druid copied to clipboard
Incorrect result format for queries with single value filters and GROUPING SET on the same dimension
Affected Version
Tested on 0.22, 0.23 and 24.0
Description
Consider the following SQL query:
select sum(added) AS added_sum, countryName from wikipedia
where cityName = 'Cilacap'
and countryName IN ('Indonesia')
GROUP BY GROUPING SETS ( (FLOOR(__time TO MONTH), countryName), (FLOOR(__time TO MONTH)) )
The grouping set dimension countryName
is also included in the WHERE clause filtering on a single value. The result for this query ends up being:
added_sum | countryName
8 | Indonesia
8 | Indonesia
instead of:
added_sum | countryName
8 | Indonesia
8 | null
Since the second subtotal group doesn't include countryName
, the second result row should have this value as null. The aggregate results are correct either ways, but formatting the query results can be confusing since COALESCE(countryName, 'TOTAL')
as countryName does not work here.
Analysis
The native query plan for this query is:
[
{
"query": {
"queryType": "groupBy",
"dataSource": {
"type": "table",
"name": "wikipedia"
},
"intervals": {
"type": "intervals",
"intervals": [
"-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"
]
},
"virtualColumns": [
{
"type": "expression",
"name": "v0",
"expression": "timestamp_floor(\"__time\",'P1M',null,'UTC')",
"outputType": "LONG"
}
],
"filter": {
"type": "and",
"fields": [
{
"type": "selector",
"dimension": "cityName",
"value": "Cilacap"
},
{
"type": "selector",
"dimension": "countryName",
"value": "Indonesia"
}
]
},
"granularity": {
"type": "all"
},
"dimensions": [
{
"type": "default",
"dimension": "v0",
"outputName": "d0",
"outputType": "LONG"
}
],
"aggregations": [
{
"type": "longSum",
"name": "a0",
"fieldName": "added"
}
],
"postAggregations": [
{
"type": "expression",
"name": "p0",
"expression": "'Indonesia'"
}
],
"limitSpec": {
"type": "default",
"columns": [],
"limit": 1001
},
"subtotalsSpec": [
[
"d0"
],
[
"d0"
]
],
"context": {
"sqlOuterLimit": 1001,
"sqlQueryId": "52fce1a4-7d39-4c02-aad1-29cec2c2e3c3",
"timestampResultField": "d0",
"timestampResultFieldGranularity": "MONTH",
"timestampResultFieldInOriginalDimensions": 0
}
},
"signature": [
{
"name": "a0",
"type": "LONG"
},
{
"name": "p0",
"type": "STRING"
}
]
}
]
It can be seen that during the planning phase, the countryName
dimension was converted into a post aggregator and removed from the dimensions list. As a result, the subtotal grouping ends up grouping on __time
in both cases.
This issue does not arise if there are more than one values in the filter list for countryName
since this dimension can no longer be converted to a post agg and thus remains in the dimension list.
One possible fix that I'm currently evaluating is to skip dropping the dimension here if there is a subtotal spec existing for that dimension and modify the row signature accordingly. Unless there are any better ways to fix this issue, I can send out the PR after adequate testing.
Hmm. I'm not sure if the dummy-dim-dropping code is the problem. It's only dropping dummy dims: ones that don't appear at all in the output. But countryName
should appear, because it isn't always the same (due to the GROUPING SETS).
I feel like whatever replaced countryName
with a constant 'Indonesia'
in the post-grouping projection is the thing that got it wrong. What did that? Once this happened, then the dummy-dim-dropping code dropped it, which is totally reasonable.
Thanks for your insights @gianm
From the Calcite debug logs, I see the following activity within the HepPlanner
:
org.apache.calcite.plan.RelOptPlanner - call#0: Apply rule [ReduceExpressionsRule(Project)] to [rel#38:LogicalProject.NONE.[](input=HepRelVertex#37,EXPR$0=$2,countryName=$1)]
org.apache.calcite.plan.RelOptPlanner - call#0: Rule ReduceExpressionsRule(Project) arguments [rel#38:LogicalProject.NONE.[](input=HepRelVertex#37,EXPR$0=$2,countryName=$1)] produced LogicalProject#42
org.apache.calcite.plan.RelOptPlanner - call#1: Apply rule [ReduceExpressionsRule(Project)] to [rel#42:LogicalProject.NONE.[](input=HepRelVertex#37,EXPR$0=$2,countryName=CAST('Indonesia':VARCHAR):VARCHAR)]
org.apache.calcite.plan.RelOptPlanner - call#2: Apply rule [ReduceExpressionsRule(Project)] to [rel#34:LogicalProject.NONE.[](input=HepRelVertex#33,$f0=FLOOR($0, FLAG(MONTH)),countryName=$3,added=$1)]
org.apache.calcite.plan.RelOptPlanner - call#2: Rule ReduceExpressionsRule(Project) arguments [rel#34:LogicalProject.NONE.[](input=HepRelVertex#33,$f0=FLOOR($0, FLAG(MONTH)),countryName=$3,added=$1)] produced LogicalProject#44
org.apache.calcite.plan.RelOptPlanner - call#3: Apply rule [ReduceExpressionsRule(Project)] to [rel#44:LogicalProject.NONE.[](input=HepRelVertex#33,$f0=FLOOR($0, FLAG(MONTH)),countryName=CAST('Indonesia':VARCHAR):VARCHAR,added=$1)]
org.apache.calcite.plan.RelOptPlanner - call#4: Apply rule [ReduceExpressionsRule(Project)] to [rel#30:LogicalProject.NONE.[](input=HepRelVertex#29,__time=$0,added=$7,cityName=$18,countryName=$19)]
org.apache.calcite.plan.RelOptPlanner - call#5: Apply rule [ReduceExpressionsRule(Project)] to [rel#30:LogicalProject.NONE.[](input=HepRelVertex#29,__time=$0,added=$7,cityName=$18,countryName=$19)]
org.apache.calcite.plan.RelOptPlanner - call#6: Apply rule [ReduceExpressionsRule(Project)] to [rel#42:LogicalProject.NONE.[](input=HepRelVertex#37,EXPR$0=$2,countryName=CAST('Indonesia':VARCHAR):VARCHAR)]
org.apache.calcite.plan.RelOptPlanner - call#7: Apply rule [ReduceExpressionsRule(Project)] to [rel#44:LogicalProject.NONE.[](input=HepRelVertex#33,$f0=FLOOR($0, FLAG(MONTH)),countryName=CAST('Indonesia':VARCHAR):VARCHAR,added=$1)]
org.apache.calcite.plan.RelOptPlanner - call#8: Apply rule [ReduceExpressionsRule(Project)] to [rel#30:LogicalProject.NONE.[](input=HepRelVertex#29,__time=$0,added=$7,cityName=$18,countryName=$19)]
org.apache.calcite.plan.RelOptPlanner - call#9: Apply rule [ReduceExpressionsRule(Filter)] to [rel#32:LogicalFilter.NONE.[](input=HepRelVertex#31,condition=AND(=($2, 'Cilacap'), =($3, 'Indonesia')))]
org.apache.calcite.plan.RelOptPlanner - For final plan, using rel#40:LogicalSort.NONE.[](input=HepRelVertex#39,fetch=1001)
org.apache.calcite.plan.RelOptPlanner - For final plan, using rel#42:LogicalProject.NONE.[](input=HepRelVertex#37,EXPR$0=$2,countryName=CAST('Indonesia':VARCHAR):VARCHAR)
org.apache.calcite.plan.RelOptPlanner - For final plan, using rel#36:LogicalAggregate.NONE.[](input=HepRelVertex#35,group={0, 1},groups=[{0, 1}, {0}],EXPR$0=SUM($2))
org.apache.calcite.plan.RelOptPlanner - For final plan, using rel#44:LogicalProject.NONE.[](input=HepRelVertex#33,$f0=FLOOR($0, FLAG(MONTH)),countryName=CAST('Indonesia':VARCHAR):VARCHAR,added=$1)
org.apache.calcite.plan.RelOptPlanner - For final plan, using rel#32:LogicalFilter.NONE.[](input=HepRelVertex#31,condition=AND(=($2, 'Cilacap'), =($3, 'Indonesia')))
org.apache.calcite.plan.RelOptPlanner - For final plan, using rel#30:LogicalProject.NONE.[](input=HepRelVertex#29,__time=$0,added=$7,cityName=$18,countryName=$19)
org.apache.calcite.plan.RelOptPlanner - For final plan, using rel#5:LogicalTableScan.NONE.[](table=[druid, wikipedia])
it looks like the countryName
dimension has been converted to CAST('Indonesia':VARCHAR):VARCHAR)
via the ProjectReduceExpressionsRule
during the HepPlanner
phase. As a result, during the post grouping projection phase, the selectProject already has countryName
as a constant which eventually converts it into a expression based post aggregation and then proceeds to drop the dimension.
Since the query has GROUPING SETS, shouldn't the calcite layer be avoiding this incorrect CAST?
does this seem like the right native plan for this query? (I replaced the "countryName" post-aggregator with a virtual column and then used that in the subTotalSpec) It generates the right output. We can agree on the native plan first and then see how to get to that native plan.
{
"queryType": "groupBy",
"dataSource": {
"type": "table",
"name": "wikipedia"
},
"intervals": {
"type": "intervals",
"intervals": [
"-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"
]
},
"virtualColumns": [
{
"type": "expression",
"name": "v0",
"expression": "timestamp_floor(\"__time\",'P1M',null,'UTC')",
"outputType": "LONG"
},
{
"type": "expression",
"name": "p0",
"expression": "'Indonesia'"
}
],
"filter": {
"type": "and",
"fields": [
{
"type": "selector",
"dimension": "cityName",
"value": "Cilacap"
},
{
"type": "selector",
"dimension": "countryName",
"value": "Indonesia"
}
]
},
"granularity": {
"type": "all"
},
"dimensions": [
{
"type": "default",
"dimension": "v0",
"outputName": "d0",
"outputType": "LONG"
},
{
"type": "default",
"dimension": "p0",
"outputName": "d1"
}
],
"aggregations": [
{
"type": "longSum",
"name": "a0",
"fieldName": "added"
}
],
"postAggregations": [
],
"limitSpec": {
"type": "default",
"columns": [],
"limit": 1001
},
"subtotalsSpec": [
[
"d0", "d1"
],
[
"d0"
]
],
"context": {
"sqlOuterLimit": 1001,
"sqlQueryId": "bc0e203a-d555-4b22-99d2-74a0794da2b8",
"timestampResultField": "d0",
"timestampResultFieldGranularity": "MONTH",
"timestampResultFieldInOriginalDimensions": 0,
"useNativeQueryExplain": true
}
}