databend icon indicating copy to clipboard operation
databend copied to clipboard

bug: aggregate scoping got unexpected results

Open ZhiHanZ opened this issue 1 year ago • 4 comments

Search before asking

  • [X] I had searched in the issues and found no similar issues.

Version

v1.2.208-nightly

What's Wrong?

I tried to replicate the tricky cases mentioned in the blog, and found that our aggregation behave differently from duckdb or cockroach preparation

CREATE TABLE aa (a int);
INSERT INTO aa VALUES (1), (2), (3)
CREATE Table xx(x int);
INSERT INTO xx VALUES (10), (20), (30)

SQL:

SELECT (SELECT sum(a+x) FROM xx LIMIT 1) FROM aa;

databend result (bendsql cli)

┌───────────────────────────────────────┐
│ (select sum((a + x)) from xx limit 1) │
│            Nullable(Int64)            │
├───────────────────────────────────────┤
│                                    63 │
│                                  NULL │
│                                  NULL │
└───────────────────────────────────────┘

duckdb result

┌───────────────────────────────────────┐
│ (SELECT sum((a + x)) FROM xx LIMIT 1) │
│                int128                 │
├───────────────────────────────────────┤
│                                    63 │
│                                    66 │
│                                    69 │
└───────────────────────────────────────┘
SELECT (SELECT sum(a) FROM xx LIMIT 1) FROM aa;

databend result error: APIError: ResponseError with 1006: Unable to get field named "0". Valid fields: []

duckdb result

┌─────────────────────────────────┐
│ (SELECT sum(a) FROM xx LIMIT 1) │
│             int128              │
├─────────────────────────────────┤
│                               6 │
└─────────────────────────────────┘

the rest of the SQL

SELECT (SELECT sum(1) FROM xx LIMIT 1) FROM aa;
SELECT (SELECT sum(x) FROM xx LIMIT 1) FROM aa;

has the same result

How to Reproduce?

No response

Are you willing to submit PR?

  • [ ] Yes I am willing to submit a PR!

ZhiHanZ avatar Nov 15 '23 00:11 ZhiHanZ

Hi I'm new here, I am interested in this issue, but I am not familiar with this project. Apart from EXPLAIN, is there any other way to see the data flow of SQL execution?

suimenno3002 avatar Jan 31 '24 05:01 suimenno3002

@suimenno3002 Hello, and thank you for your interest in this issue.

For your question, EXPLAIN is ok? With the latest version(v1.2.132-nightly), the explain like:

-[ EXPLAIN ]-----------------------------------
EvalScalar
├── output columns: [(select sum((a + x)) from xx limit 1) (#4)]
├── expressions: [scalar_subquery_3 (#3)]
├── estimated rows: 3.00
└── HashJoin
    ├── output columns: [sum((a + x)) (#3)]
    ├── join type: LEFT SINGLE
    ├── build keys: [a (#5)]
    ├── probe keys: [a (#0)]
    ├── filters: []
    ├── estimated rows: 3.00
    ├── Limit(Build)
    │   ├── output columns: [sum((a + x)) (#3), a (#5)]
    │   ├── limit: 1
    │   ├── offset: 0
    │   ├── estimated rows: 0.00
    │   └── AggregateFinal
    │       ├── output columns: [sum((a + x)) (#3), a (#5)]
    │       ├── group by: [a]
    │       ├── aggregate functions: [sum(sum_arg_0)]
    │       ├── limit: 1
    │       ├── estimated rows: 0.00
    │       └── AggregatePartial
    │           ├── output columns: [sum((a + x)) (#3), #_group_by_key]
    │           ├── group by: [a]
    │           ├── aggregate functions: [sum(sum_arg_0)]
    │           ├── estimated rows: 0.00
    │           └── EvalScalar
    │               ├── output columns: [a (#5), sum_arg_0 (#2)]
    │               ├── expressions: [a (#5) + xx.x (#1)]
    │               ├── estimated rows: 0.00
    │               └── HashJoin
    │                   ├── output columns: [xx.x (#1), a (#5)]
    │                   ├── join type: CROSS
    │                   ├── build keys: []
    │                   ├── probe keys: []
    │                   ├── filters: []
    │                   ├── estimated rows: 0.00
    │                   ├── AggregateFinal(Build)
    │                   │   ├── output columns: [a (#5)]
    │                   │   ├── group by: [a]
    │                   │   ├── aggregate functions: []
    │                   │   ├── estimated rows: 0.00
    │                   │   └── AggregatePartial
    │                   │       ├── output columns: [#_group_by_key]
    │                   │       ├── group by: [a]
    │                   │       ├── aggregate functions: []
    │                   │       ├── estimated rows: 0.00
    │                   │       └── TableScan
    │                   │           ├── table: default.checksb.aa
    │                   │           ├── output columns: [a (#5)]
    │                   │           ├── read rows: 3
    │                   │           ├── read bytes: 44
    │                   │           ├── partitions total: 1
    │                   │           ├── partitions scanned: 1
    │                   │           ├── pruning stats: [segments: <range pruning: 1 to 1>, blocks: <range pruning: 1 to 1, bloom pruning: 0 to 0>]
    │                   │           ├── push downs: [filters: [], limit: NONE]
    │                   │           └── estimated rows: 0.00
    │                   └── TableScan(Probe)
    │                       ├── table: default.checksb.xx
    │                       ├── output columns: [x (#1)]
    │                       ├── read rows: 3
    │                       ├── read bytes: 44
    │                       ├── partitions total: 1
    │                       ├── partitions scanned: 1
    │                       ├── pruning stats: [segments: <range pruning: 1 to 1>, blocks: <range pruning: 1 to 1, bloom pruning: 0 to 0>]
    │                       ├── push downs: [filters: [], limit: NONE]
    │                       └── estimated rows: 3.00
    └── TableScan(Probe)
        ├── table: default.checksb.aa
        ├── output columns: [a (#0)]
        ├── read rows: 3
        ├── read bytes: 44
        ├── partitions total: 1
        ├── partitions scanned: 1
        ├── pruning stats: [segments: <range pruning: 1 to 1>, blocks: <range pruning: 1 to 1, bloom pruning: 0 to 0>]
        ├── push downs: [filters: [], limit: NONE]
        └── estimated rows: 3.00

And the result is:

bohu@(askbend-small)/checksb> SELECT (SELECT sum(a+x) FROM xx LIMIT 1) FROM aa;

SELECT
  (
    SELECT
      sum(a + x)
    FROM
      xx
    LIMIT
      1
  )
FROM
  aa

┌───────────────────────────────────────┐
│ (select sum((a + x)) from xx limit 1) │
│            Nullable(Int64)            │
├───────────────────────────────────────┤
│                                    63 │
│                                  NULL │
│                                  NULL │
└───────────────────────────────────────┘

But the snowflake returns an error(Databend does not follow the duckdb or cockroach):

Unsupported subquery type cannot be evaluated

BohuTANG avatar Jan 31 '24 06:01 BohuTANG

@BohuTANG

Thanks for your reply, I found the following statement in the snowflake documentation

The only type of subquery that allows a LIMIT / FETCH clause is an uncorrelated scalar subquery. Also, because an uncorrelated scalar subquery returns only 1 row, the LIMIT clause has little or no practical value inside a subquery.

So if we want to align the behavior of snowflake in this situation, maybe we can give a clear error message like snowflake.

suimenno3002 avatar Jan 31 '24 08:01 suimenno3002

The issue is a correlated subquery for aggregate. It can be a bit complex, especially for the beginner in databend.

I need to evaluate it and decide how to solve (support or report error) the issue.

Thanks for your interest @suimenno3002 !

xudong963 avatar Jan 31 '24 10:01 xudong963