druid Incorrect result format for queries with single value filters and GROUPING SET on the same dimension

Affected Version

Tested on 0.22, 0.23 and 24.0

Description

Consider the following SQL query:

select sum(added) AS added_sum, countryName from wikipedia 
where cityName = 'Cilacap'
and countryName IN ('Indonesia')
GROUP BY GROUPING SETS ( (FLOOR(__time TO MONTH), countryName), (FLOOR(__time TO MONTH)) )

The grouping set dimension countryName is also included in the WHERE clause filtering on a single value. The result for this query ends up being:

added_sum | countryName
8         | Indonesia
8         | Indonesia

instead of:

added_sum | countryName
8         | Indonesia
8         | null

Since the second subtotal group doesn't include countryName , the second result row should have this value as null. The aggregate results are correct either ways, but formatting the query results can be confusing since COALESCE(countryName, 'TOTAL') as countryName does not work here.

Analysis

The native query plan for this query is:

[
    {
        "query": {
            "queryType": "groupBy",
            "dataSource": {
                "type": "table",
                "name": "wikipedia"
            },
            "intervals": {
                "type": "intervals",
                "intervals": [
                    "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"
                ]
            },
            "virtualColumns": [
                {
                    "type": "expression",
                    "name": "v0",
                    "expression": "timestamp_floor(\"__time\",'P1M',null,'UTC')",
                    "outputType": "LONG"
                }
            ],
            "filter": {
                "type": "and",
                "fields": [
                    {
                        "type": "selector",
                        "dimension": "cityName",
                        "value": "Cilacap"
                    },
                    {
                        "type": "selector",
                        "dimension": "countryName",
                        "value": "Indonesia"
                    }
                ]
            },
            "granularity": {
                "type": "all"
            },
            "dimensions": [
                {
                    "type": "default",
                    "dimension": "v0",
                    "outputName": "d0",
                    "outputType": "LONG"
                }
            ],
            "aggregations": [
                {
                    "type": "longSum",
                    "name": "a0",
                    "fieldName": "added"
                }
            ],
            "postAggregations": [
                {
                    "type": "expression",
                    "name": "p0",
                    "expression": "'Indonesia'"
                }
            ],
            "limitSpec": {
                "type": "default",
                "columns": [],
                "limit": 1001
            },
            "subtotalsSpec": [
                [
                    "d0"
                ],
                [
                    "d0"
                ]
            ],
            "context": {
                "sqlOuterLimit": 1001,
                "sqlQueryId": "52fce1a4-7d39-4c02-aad1-29cec2c2e3c3",
                "timestampResultField": "d0",
                "timestampResultFieldGranularity": "MONTH",
                "timestampResultFieldInOriginalDimensions": 0
            }
        },
        "signature": [
            {
                "name": "a0",
                "type": "LONG"
            },
            {
                "name": "p0",
                "type": "STRING"
            }
        ]
    }
]

It can be seen that during the planning phase, the countryName dimension was converted into a post aggregator and removed from the dimensions list. As a result, the subtotal grouping ends up grouping on __time in both cases. This issue does not arise if there are more than one values in the filter list for countryName since this dimension can no longer be converted to a post agg and thus remains in the dimension list.

One possible fix that I'm currently evaluating is to skip dropping the dimension here if there is a subtotal spec existing for that dimension and modify the row signature accordingly. Unless there are any better ways to fix this issue, I can send out the PR after adequate testing.

Oct 10 '22 22:10 a2l007

Hmm. I'm not sure if the dummy-dim-dropping code is the problem. It's only dropping dummy dims: ones that don't appear at all in the output. But countryName should appear, because it isn't always the same (due to the GROUPING SETS).

I feel like whatever replaced countryName with a constant 'Indonesia' in the post-grouping projection is the thing that got it wrong. What did that? Once this happened, then the dummy-dim-dropping code dropped it, which is totally reasonable.

Oct 11 '22 04:10 gianm

Thanks for your insights @gianm From the Calcite debug logs, I see the following activity within the HepPlanner:

org.apache.calcite.plan.RelOptPlanner - call#0: Apply rule [ReduceExpressionsRule(Project)] to [rel#38:LogicalProject.NONE.[](input=HepRelVertex#37,EXPR$0=$2,countryName=$1)]
org.apache.calcite.plan.RelOptPlanner - call#0: Rule ReduceExpressionsRule(Project) arguments [rel#38:LogicalProject.NONE.[](input=HepRelVertex#37,EXPR$0=$2,countryName=$1)] produced LogicalProject#42
org.apache.calcite.plan.RelOptPlanner - call#1: Apply rule [ReduceExpressionsRule(Project)] to [rel#42:LogicalProject.NONE.[](input=HepRelVertex#37,EXPR$0=$2,countryName=CAST('Indonesia':VARCHAR):VARCHAR)]
org.apache.calcite.plan.RelOptPlanner - call#2: Apply rule [ReduceExpressionsRule(Project)] to [rel#34:LogicalProject.NONE.[](input=HepRelVertex#33,$f0=FLOOR($0, FLAG(MONTH)),countryName=$3,added=$1)]
org.apache.calcite.plan.RelOptPlanner - call#2: Rule ReduceExpressionsRule(Project) arguments [rel#34:LogicalProject.NONE.[](input=HepRelVertex#33,$f0=FLOOR($0, FLAG(MONTH)),countryName=$3,added=$1)] produced LogicalProject#44
org.apache.calcite.plan.RelOptPlanner - call#3: Apply rule [ReduceExpressionsRule(Project)] to [rel#44:LogicalProject.NONE.[](input=HepRelVertex#33,$f0=FLOOR($0, FLAG(MONTH)),countryName=CAST('Indonesia':VARCHAR):VARCHAR,added=$1)]
org.apache.calcite.plan.RelOptPlanner - call#4: Apply rule [ReduceExpressionsRule(Project)] to [rel#30:LogicalProject.NONE.[](input=HepRelVertex#29,__time=$0,added=$7,cityName=$18,countryName=$19)]
org.apache.calcite.plan.RelOptPlanner - call#5: Apply rule [ReduceExpressionsRule(Project)] to [rel#30:LogicalProject.NONE.[](input=HepRelVertex#29,__time=$0,added=$7,cityName=$18,countryName=$19)]
org.apache.calcite.plan.RelOptPlanner - call#6: Apply rule [ReduceExpressionsRule(Project)] to [rel#42:LogicalProject.NONE.[](input=HepRelVertex#37,EXPR$0=$2,countryName=CAST('Indonesia':VARCHAR):VARCHAR)]
org.apache.calcite.plan.RelOptPlanner - call#7: Apply rule [ReduceExpressionsRule(Project)] to [rel#44:LogicalProject.NONE.[](input=HepRelVertex#33,$f0=FLOOR($0, FLAG(MONTH)),countryName=CAST('Indonesia':VARCHAR):VARCHAR,added=$1)]
org.apache.calcite.plan.RelOptPlanner - call#8: Apply rule [ReduceExpressionsRule(Project)] to [rel#30:LogicalProject.NONE.[](input=HepRelVertex#29,__time=$0,added=$7,cityName=$18,countryName=$19)]
org.apache.calcite.plan.RelOptPlanner - call#9: Apply rule [ReduceExpressionsRule(Filter)] to [rel#32:LogicalFilter.NONE.[](input=HepRelVertex#31,condition=AND(=($2, 'Cilacap'), =($3, 'Indonesia')))]

org.apache.calcite.plan.RelOptPlanner - For final plan, using rel#40:LogicalSort.NONE.[](input=HepRelVertex#39,fetch=1001)
org.apache.calcite.plan.RelOptPlanner - For final plan, using rel#42:LogicalProject.NONE.[](input=HepRelVertex#37,EXPR$0=$2,countryName=CAST('Indonesia':VARCHAR):VARCHAR)
org.apache.calcite.plan.RelOptPlanner - For final plan, using rel#36:LogicalAggregate.NONE.[](input=HepRelVertex#35,group={0, 1},groups=[{0, 1}, {0}],EXPR$0=SUM($2))
org.apache.calcite.plan.RelOptPlanner - For final plan, using rel#44:LogicalProject.NONE.[](input=HepRelVertex#33,$f0=FLOOR($0, FLAG(MONTH)),countryName=CAST('Indonesia':VARCHAR):VARCHAR,added=$1)
org.apache.calcite.plan.RelOptPlanner - For final plan, using rel#32:LogicalFilter.NONE.[](input=HepRelVertex#31,condition=AND(=($2, 'Cilacap'), =($3, 'Indonesia')))
org.apache.calcite.plan.RelOptPlanner - For final plan, using rel#30:LogicalProject.NONE.[](input=HepRelVertex#29,__time=$0,added=$7,cityName=$18,countryName=$19)
org.apache.calcite.plan.RelOptPlanner - For final plan, using rel#5:LogicalTableScan.NONE.[](table=[druid, wikipedia])

it looks like the countryName dimension has been converted to CAST('Indonesia':VARCHAR):VARCHAR) via the ProjectReduceExpressionsRule during the HepPlanner phase. As a result, during the post grouping projection phase, the selectProject already has countryName as a constant which eventually converts it into a expression based post aggregation and then proceeds to drop the dimension. Since the query has GROUPING SETS, shouldn't the calcite layer be avoiding this incorrect CAST?

Oct 15 '22 00:10 a2l007

does this seem like the right native plan for this query? (I replaced the "countryName" post-aggregator with a virtual column and then used that in the subTotalSpec) It generates the right output. We can agree on the native plan first and then see how to get to that native plan.

{
  "queryType": "groupBy",
  "dataSource": {
    "type": "table",
    "name": "wikipedia"
  },
  "intervals": {
    "type": "intervals",
    "intervals": [
      "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"
    ]
  },
  "virtualColumns": [
    {
      "type": "expression",
      "name": "v0",
      "expression": "timestamp_floor(\"__time\",'P1M',null,'UTC')",
      "outputType": "LONG"
    },
    {
      "type": "expression",
      "name": "p0",
      "expression": "'Indonesia'"
    }
  ],
  "filter": {
    "type": "and",
    "fields": [
      {
        "type": "selector",
        "dimension": "cityName",
        "value": "Cilacap"
      },
      {
        "type": "selector",
        "dimension": "countryName",
        "value": "Indonesia"
      }
    ]
  },
  "granularity": {
    "type": "all"
  },
  "dimensions": [
    {
      "type": "default",
      "dimension": "v0",
      "outputName": "d0",
      "outputType": "LONG"
    },
    {
      "type": "default",
      "dimension": "p0",
      "outputName": "d1"
    }
  ],
  "aggregations": [
    {
      "type": "longSum",
      "name": "a0",
      "fieldName": "added"
    }
  ],
  "postAggregations": [
   
  ],
  "limitSpec": {
    "type": "default",
    "columns": [],
    "limit": 1001
  },
  "subtotalsSpec": [
    [
      "d0", "d1"
    ],
    [
      "d0"
    ]
  ],
  "context": {
    "sqlOuterLimit": 1001,
    "sqlQueryId": "bc0e203a-d555-4b22-99d2-74a0794da2b8",
    "timestampResultField": "d0",
    "timestampResultFieldGranularity": "MONTH",
    "timestampResultFieldInOriginalDimensions": 0,
    "useNativeQueryExplain": true
  }
}

Oct 15 '22 05:10 abhishekagarwal87

druid druid copied to clipboard

Incorrect result format for queries with single value filters and GROUPING SET on the same dimension

Affected Version

Description

Analysis

druid
druid copied to clipboard