openCypher icon indicating copy to clipboard operation
openCypher copied to clipboard

Grouping key selection for aggregating subqueries

Open jmarton opened this issue 7 years ago • 7 comments

CIR-2017-219

Grouping key selection for aggregating subqueries

Introduction

openCypher provides way to express aggregation queries by an implicit grouping key, and the Neo4j 3.1 documentation tells:

RETURN n, count(*)

We have two return expressions: n, and count(*). The first, n, is not an aggregate function, and so it will be the grouping key. The latter, count(*) is an aggregate expression.

As long as aggregate and non-aggregate expressions are clearly separated in the RETURN/WITH expression list, i.e. each comma-separated expression is either an aggregate or a non-aggregate expression, the situation seems clear. We have seen an example demonstrating the problem, and a proposed solution (#188, #218) is to force clear separation of aggregate and non-aggregate expressions.

In this CIR, I'd like to propose a solution that fits in the current syntax and it's semantics is still clear.

Options

  1. grouping key is the tuple built from all variables (node, relationship, their properties or variables chained from previous subquery) that appear outside of aggregate functions of a particular WITH/RETURN clause
  2. each item of the expression list in WITH/RETURN are forced to contain either (i) no aggregate function or a (ii) single aggregate function at the outermost level (this is the approach in #188, #218). The grouping key is the tuple built from items of type (i)

Discussion

Neither option 1 nor option 2 restricts the expressiveness of the language as the resultset of either approach can be expressed using the other approach by introducing subqueries.

Option 1: group by variables that appear outside of aggregate functions

The most notable problem with this approach is that it would change behaviour of Neo4j even when aggregating and non-aggregating return items are clearly separated.

On the other hand, grouping on the variables, i.e. the basic building blocks of an expression seems to be flexible yet still clear way in case of a complex expression.

Variables can stand for:

  • nodes
  • relationships
  • properties of nodes or relationships
  • expressions chained from previous subqueries

Option 2: group by expressions that has no aggregation

The most obvious benefit of this approach is that Neo4j currently works this way if each return item is either aggregating or non-aggregating.

However, counter-intuitive results come if mixed. Posing restrictions on creating complex by mixing aggregating and non-aggregating expressions is a safety net for beginners, but cumbersome for more complex queries.

Examples

Sample data

We initialize our database of 10 nodes and no relationships. 5 of them has a single label Foo, the other 5 has Bar, both partition having a single node for weights of -2, -1, 0, 1, 2.

unwind range(1, 5) as i
create (:Foo {id: i, weight: i-3})
;
unwind range(1, 5) as i
create (:Bar {id: 10+i, weight: i-3})
;

match (n)
return labels(n), n
order by n.id
;

Which leads to the dataset:

╒═══════════╤═════════════════════════╕
│"labels(n)"│"n"                      │
╞═══════════╪═════════════════════════╡
│["Foo"]    │{"weight":"-2","id":"1"} │
├───────────┼─────────────────────────┤
│["Foo"]    │{"weight":"-1","id":"2"} │
├───────────┼─────────────────────────┤
│["Foo"]    │{"weight":"0","id":"3"}  │
├───────────┼─────────────────────────┤
│["Foo"]    │{"weight":"1","id":"4"}  │
├───────────┼─────────────────────────┤
│["Foo"]    │{"weight":"2","id":"5"}  │
├───────────┼─────────────────────────┤
│["Bar"]    │{"weight":"-2","id":"11"}│
├───────────┼─────────────────────────┤
│["Bar"]    │{"weight":"-1","id":"12"}│
├───────────┼─────────────────────────┤
│["Bar"]    │{"weight":"0","id":"13"} │
├───────────┼─────────────────────────┤
│["Bar"]    │{"weight":"1","id":"14"} │
├───────────┼─────────────────────────┤
│["Bar"]    │{"weight":"2","id":"15"} │
└───────────┴─────────────────────────┘

Query1a: retrieving a variable and an aggregate: a clear situation

match (n)
return n.weight, count(*)

Both approaces should return the same as the non-aggregating return item is a single variable:

╒══════════╤══════════╕
│"n.weight"│"count(*)"│
╞══════════╪══════════╡
│"2"       │"2"       │
├──────────┼──────────┤
│"1"       │"2"       │
├──────────┼──────────┤
│"-2"      │"2"       │
├──────────┼──────────┤
│"0"       │"2"       │
├──────────┼──────────┤
│"-1"      │"2"       │
└──────────┴──────────┘

Query1b: retrieving a variable and a computation based on aggregate

match (n)
return n.weight as w, 2*count(*) as c2

This is still clear, however, using option 2, we have to rewrite this as:

match (n)
with n.weight as w, count(*) as c
return w, 2*c as c2

Query2: aggregating and non-aggregating return items separated, retrieving result of non-aggregating calculation

match (n)
return abs(n.weight) as abs, count(*) as count

Option 2

╒═════╤═══════╕
│"abs"│"count"│
╞═════╪═══════╡
│"2"  │"4"    │
├─────┼───────┤
│"1"  │"4"    │
├─────┼───────┤
│"0"  │"2"    │
└─────┴───────┘

When using Option 1, this result can be expressed like

match (n)
with abs(n.weight) as abs, n
return abs, count(*) as count

Option 1

╒═════╤═══════╕
│"abs"│"count"│
╞═════╪═══════╡
│"2"  │"2"    │
├─────┼───────┤
│"1"  │"2"    │
├─────┼───────┤
│"2"  │"2"    │
├─────┼───────┤
│"0"  │"2"    │
├─────┼───────┤
│"1"  │"2"    │
└─────┴───────┘

When using semantics of Option 2, this result can be achieved like

match (n)
with n.weight as w, count(*) as c
return abs(w) as abs, c as count

Query 3: returning complex, single item

match (n: Foo), (m:Bar)
where n.weight = - m.weight
return n.weight+m.weight+count(*) as expr
     , collect(n) as nc, collect(m) as mc

Option 2

Can't handle such a query as non-aggregating and aggregating expressions are mixed in 1st return item.

Option 1

Option 1 should return:

╒══════╤══════════════════════════╤═══════════════════════════╕
│"expr"│"nc"                      │"mc"                       │
╞══════╪══════════════════════════╪═══════════════════════════╡
│"1"   │[{"weight":"2","id":"5"}] │[{"weight":"-2","id":"11"}]│
├──────┼──────────────────────────┼───────────────────────────┤
│"1"   │[{"weight":"1","id":"4"}] │[{"weight":"-1","id":"12"}]│
├──────┼──────────────────────────┼───────────────────────────┤
│"1"   │[{"weight":"-1","id":"2"}]│[{"weight":"1","id":"14"}] │
├──────┼──────────────────────────┼───────────────────────────┤
│"1"   │[{"weight":"0","id":"3"}] │[{"weight":"0","id":"13"}] │
├──────┼──────────────────────────┼───────────────────────────┤
│"1"   │[{"weight":"-2","id":"1"}]│[{"weight":"2","id":"15"}] │
└──────┴──────────────────────────┴───────────────────────────┘

This result expressed when using syntax and semantics of option 2:

match (n: Foo), (m:Bar)
where n.weight = - m.weight
with n.weight as nw, m.weight as mw
   , count(*) as c
   , collect(n) as nc, collect(m) as mc
return nw+mw+c as expr, nc, mc

jmarton avatar Apr 17 '17 22:04 jmarton

Option 3

The solution in the option lies in replacing aggregation function call by the neutral element of the operator which binds aggregation. Replacement should be applied in mixed (aggregation and non-aggregation) expression only and if possible only. E.G. in the following expression:

e1 + count(*) + e2 * count(*) + e3

for designation of aggregation key aggregation function call should be replaced in following way:

e1 + 0 + e2 * 1 + e3

and that expression becomes an aggregation key component. In above example aggregation function call bound by arithmetic "+" was replaced by "0" (addition neutral element) and aggregation function call bound by arithmetic "*" was replaced by "1" (multiplication neutral element).

The general rule, for which expression replacement is possible is as the following: the shape of groups for evaluated grouping key is independent of the value of constant we place for aggregation function call. For example, let's consider expression

e1 + avg(e2)

where e1 multiset of values is {1,1,1,2,2,3}, what generates 3 groups of size 3, 2 and 1. Let's notice that shape of groups does not depend on what constant avg(e2) will be replaced by. E.g.: if we replaced avg(e2) by constant 10, multiset {11,11,11,12,12,13} determine identical set of 3 groups of sizes 3, 2 and 1.The example shows situation the replacement is possible.

Let's now consider expression:

list[e1..min(e2)]

In this case, value of constant that would replace min(e2) might influence evaluated value of expression and shape of groups, so we want to deny using such constructions with message „grouping expression is not independent of aggregating functions”.

So, the general rule of the possibility of replacement (and subsequent designation of aggregation key) could be formulated as follows:

Replacement of aggregation function call by a constant is possible if selection of the constant value does not change groups selected by the constructed key.

Existing examples

For such database:

create (:L {a:1,b:2,c:3})
create (:L {a:2,b:3,c:1})
create (:L {a:3,b:1,c:2})

there are two following (basically equivalent) queries that give two different results.

  • match (x:L) return x.a + count() + x.b + count() + x.c as e;

  • match (x:L) return x.a + x.b + x.c + count() + count() as e;

When the option 3 would be applied result of first changes and both queries would return:

+----------------- ----+
| e                    |
+------- --------------+
| 12                   |
+----------------------+

Following queries results are the same (and intuitive) as in current neo4j implementation:

match (n) return n.weight, count(*)

match (n) return n.weight as w, 2count() as c2

match (n) return abs(n.weight) as abs, count(*) as count

In the following query, replacement mechanism would be applied with full power and originally query

match (n: Foo), (m:Bar) where n.weight = - m.weight return n.weight+m.weight+count(*) as expr , collect(n) as nc, collect(m) as mc

would have the same semantic as transformed ("explicit" group by) version:

match (n: Foo), (m:Bar) where n.weight = - m.weight with n.weight + m.weight as as nwmw , count(*) as c , collect(n) as nc, collect(m) as mc return nwmw + c as expr, nc, mc

JanekPo avatar May 19 '17 12:05 JanekPo

@Mats-SX how to proceed, i.e. how to decide on the aggregation semantics/grouping key selection for openCypher?

jmarton avatar May 29 '17 10:05 jmarton

@jmarton That is for the oCIG to decide, which will have its first meeting in June. A schedule will be broadcasted shortly (website, google groups, Slack channels); you are of course all invited.

The exact format of the decision-making process within the meeting is not set in stone. I imagine something similar to having one or several propositions, followed by a discussion, and eventually the reaching of a consensual decision.

Mats-SX avatar May 29 '17 11:05 Mats-SX

@jmarton & @JanekPo Please be advised that this CIR (and others in the general area of aggregation and grouping) will be discussed at the first oCIG meeting, as detailed here: http://www.opencypher.org/ocig. In light of this, please can the two of you be ready to present - if necessary - this CIR and your options at the meeting?

Thanks!

( @JanekPo I have added your name to the agenda, but the build process is very delayed at the moment, which is why your name is not published at the time of writing)

petraselmer avatar May 31 '17 15:05 petraselmer

This topic came up at Neo4j recently, and we had a discussion about it in our internal Cypher group.

We concluded that the Neo4j opinon on the matter is that expressions containing an aggregation should not be allowed. Instead queries should clearly separate the grouping key from the aggregation.

We thus propose that the example queries above should be written as follows:

Query 1:

MATCH (n)
RETURN n.weight, count(*)

Here the grouping key is n.weight, and the resulting count will be the number of nodes with the same weight for each distinct weight in the graph:

╒══════════╤══════════╕
│"n.weight"│"count(*)"│
╞══════════╪══════════╡
│"2"       │"2"       │
├──────────┼──────────┤
│"1"       │"2"       │
├──────────┼──────────┤
│"-2"      │"2"       │
├──────────┼──────────┤
│"0"       │"2"       │
├──────────┼──────────┤
│"-1"      │"2"       │
└──────────┴──────────┘

Query 2:

MATCH (n)
RETURN abs(n.weight) AS abs, count(*) AS count

Here the grouping key is abs(n.weight), and the resulting count will be the number of nodes with the same absolute value for weight for each distinct absolute value of weight in the graph:

╒═════╤═══════╕
│"abs"│"count"│
╞═════╪═══════╡
│"2"  │"4"    │
├─────┼───────┤
│"1"  │"4"    │
├─────┼───────┤
│"0"  │"2"    │
└─────┴───────┘

Query 3:

Variant a:

MATCH (n: Foo), (m:Bar)
WHERE n.weight = - m.weight
WITH n.weight + m.weight AS combined_weight, count(*) AS c, collect(n) AS nc, collect(m) AS mc
RETURN combined_weight + c AS expr, nc, mc

Here the grouping key is n.weight + m.weight, and the resulting count will be the number of node pairs where the combined weight sums up to the same value (which will be zero because of the WHERE):

╒══════╤══════╤══════╕
│"expr"│ "nc" │ "mc" │
╞══════╪══════╪══════╡
│ 5    │ ...  │ ...  │
└──────┴──────┴──────┘

Variant b:

MATCH (n: Foo), (m:Bar)
WHERE n.weight = - m.weight
WITH n.weight AS nw, m.weight AS mw, count(*) AS c, collect(n) AS nc, collect(m) AS mc
RETURN nw + mw + c AS expr, nc, mc

Here the grouping key is n.weight, m.weight, and the resulting count will be the number of node pairs where the tuple of weights is the same:

╒══════╤══════════════════════════╤═══════════════════════════╕
│"expr"│"nc"                      │"mc"                       │
╞══════╪══════════════════════════╪═══════════════════════════╡
│ 1    │[{"weight":"2","id":"5"}] │[{"weight":"-2","id":"11"}]│
├──────┼──────────────────────────┼───────────────────────────┤
│ 1    │[{"weight":"1","id":"4"}] │[{"weight":"-1","id":"12"}]│
├──────┼──────────────────────────┼───────────────────────────┤
│ 1    │[{"weight":"-1","id":"2"}]│[{"weight":"1","id":"14"}] │
├──────┼──────────────────────────┼───────────────────────────┤
│ 1    │[{"weight":"0","id":"3"}] │[{"weight":"0","id":"13"}] │
├──────┼──────────────────────────┼───────────────────────────┤
│ 1    │[{"weight":"-2","id":"1"}]│[{"weight":"2","id":"15"}] │
└──────┴──────────────────────────┴───────────────────────────┘

thobe avatar Jan 27 '21 09:01 thobe

We have published a paper on it in 2019: https://dl.acm.org/doi/10.1145/3315507.3330196 I was sure there is an open access to the paper, but it seems it's not. Please let me know if anyone is interested in full version.

JanekPo avatar Jan 27 '21 20:01 JanekPo

@JanekPo I would be glad to receive a copy of the full paper. Are you okay with attaching a copy in a comment to this thread? Otherwise you can find my email in my Github profile.

Mats-SX avatar Jan 28 '21 08:01 Mats-SX