openCypher
openCypher copied to clipboard
Grouping key selection for aggregating subqueries
CIR-2017-219
Grouping key selection for aggregating subqueries
Introduction
openCypher provides way to express aggregation queries by an implicit grouping key, and the Neo4j 3.1 documentation tells:
RETURN n, count(*)
We have two return expressions:
n
, andcount(*)
. The first,n
, is not an aggregate function, and so it will be the grouping key. The latter,count(*)
is an aggregate expression.
As long as aggregate and non-aggregate expressions are clearly separated in the RETURN/WITH expression list, i.e. each comma-separated expression is either an aggregate or a non-aggregate expression, the situation seems clear. We have seen an example demonstrating the problem, and a proposed solution (#188, #218) is to force clear separation of aggregate and non-aggregate expressions.
In this CIR, I'd like to propose a solution that fits in the current syntax and it's semantics is still clear.
Options
- grouping key is the tuple built from all variables (node, relationship, their properties or variables chained from previous subquery) that appear outside of aggregate functions of a particular WITH/RETURN clause
- each item of the expression list in WITH/RETURN are forced to contain either
(i)
no aggregate function or a(ii)
single aggregate function at the outermost level (this is the approach in #188, #218). The grouping key is the tuple built from items of type(i)
Discussion
Neither option 1 nor option 2 restricts the expressiveness of the language as the resultset of either approach can be expressed using the other approach by introducing subqueries.
Option 1: group by variables that appear outside of aggregate functions
The most notable problem with this approach is that it would change behaviour of Neo4j even when aggregating and non-aggregating return items are clearly separated.
On the other hand, grouping on the variables, i.e. the basic building blocks of an expression seems to be flexible yet still clear way in case of a complex expression.
Variables can stand for:
- nodes
- relationships
- properties of nodes or relationships
- expressions chained from previous subqueries
Option 2: group by expressions that has no aggregation
The most obvious benefit of this approach is that Neo4j currently works this way if each return item is either aggregating or non-aggregating.
However, counter-intuitive results come if mixed. Posing restrictions on creating complex by mixing aggregating and non-aggregating expressions is a safety net for beginners, but cumbersome for more complex queries.
Examples
Sample data
We initialize our database of 10 nodes and no relationships. 5 of them has a single label Foo, the other 5 has Bar, both partition having a single node for weights of -2, -1, 0, 1, 2.
unwind range(1, 5) as i
create (:Foo {id: i, weight: i-3})
;
unwind range(1, 5) as i
create (:Bar {id: 10+i, weight: i-3})
;
match (n)
return labels(n), n
order by n.id
;
Which leads to the dataset:
╒═══════════╤═════════════════════════╕
│"labels(n)"│"n" │
╞═══════════╪═════════════════════════╡
│["Foo"] │{"weight":"-2","id":"1"} │
├───────────┼─────────────────────────┤
│["Foo"] │{"weight":"-1","id":"2"} │
├───────────┼─────────────────────────┤
│["Foo"] │{"weight":"0","id":"3"} │
├───────────┼─────────────────────────┤
│["Foo"] │{"weight":"1","id":"4"} │
├───────────┼─────────────────────────┤
│["Foo"] │{"weight":"2","id":"5"} │
├───────────┼─────────────────────────┤
│["Bar"] │{"weight":"-2","id":"11"}│
├───────────┼─────────────────────────┤
│["Bar"] │{"weight":"-1","id":"12"}│
├───────────┼─────────────────────────┤
│["Bar"] │{"weight":"0","id":"13"} │
├───────────┼─────────────────────────┤
│["Bar"] │{"weight":"1","id":"14"} │
├───────────┼─────────────────────────┤
│["Bar"] │{"weight":"2","id":"15"} │
└───────────┴─────────────────────────┘
Query1a: retrieving a variable and an aggregate: a clear situation
match (n)
return n.weight, count(*)
Both approaces should return the same as the non-aggregating return item is a single variable:
╒══════════╤══════════╕
│"n.weight"│"count(*)"│
╞══════════╪══════════╡
│"2" │"2" │
├──────────┼──────────┤
│"1" │"2" │
├──────────┼──────────┤
│"-2" │"2" │
├──────────┼──────────┤
│"0" │"2" │
├──────────┼──────────┤
│"-1" │"2" │
└──────────┴──────────┘
Query1b: retrieving a variable and a computation based on aggregate
match (n)
return n.weight as w, 2*count(*) as c2
This is still clear, however, using option 2, we have to rewrite this as:
match (n)
with n.weight as w, count(*) as c
return w, 2*c as c2
Query2: aggregating and non-aggregating return items separated, retrieving result of non-aggregating calculation
match (n)
return abs(n.weight) as abs, count(*) as count
Option 2
╒═════╤═══════╕
│"abs"│"count"│
╞═════╪═══════╡
│"2" │"4" │
├─────┼───────┤
│"1" │"4" │
├─────┼───────┤
│"0" │"2" │
└─────┴───────┘
When using Option 1, this result can be expressed like
match (n)
with abs(n.weight) as abs, n
return abs, count(*) as count
Option 1
╒═════╤═══════╕
│"abs"│"count"│
╞═════╪═══════╡
│"2" │"2" │
├─────┼───────┤
│"1" │"2" │
├─────┼───────┤
│"2" │"2" │
├─────┼───────┤
│"0" │"2" │
├─────┼───────┤
│"1" │"2" │
└─────┴───────┘
When using semantics of Option 2, this result can be achieved like
match (n)
with n.weight as w, count(*) as c
return abs(w) as abs, c as count
Query 3: returning complex, single item
match (n: Foo), (m:Bar)
where n.weight = - m.weight
return n.weight+m.weight+count(*) as expr
, collect(n) as nc, collect(m) as mc
Option 2
Can't handle such a query as non-aggregating and aggregating expressions are mixed in 1st return item.
Option 1
Option 1 should return:
╒══════╤══════════════════════════╤═══════════════════════════╕
│"expr"│"nc" │"mc" │
╞══════╪══════════════════════════╪═══════════════════════════╡
│"1" │[{"weight":"2","id":"5"}] │[{"weight":"-2","id":"11"}]│
├──────┼──────────────────────────┼───────────────────────────┤
│"1" │[{"weight":"1","id":"4"}] │[{"weight":"-1","id":"12"}]│
├──────┼──────────────────────────┼───────────────────────────┤
│"1" │[{"weight":"-1","id":"2"}]│[{"weight":"1","id":"14"}] │
├──────┼──────────────────────────┼───────────────────────────┤
│"1" │[{"weight":"0","id":"3"}] │[{"weight":"0","id":"13"}] │
├──────┼──────────────────────────┼───────────────────────────┤
│"1" │[{"weight":"-2","id":"1"}]│[{"weight":"2","id":"15"}] │
└──────┴──────────────────────────┴───────────────────────────┘
This result expressed when using syntax and semantics of option 2:
match (n: Foo), (m:Bar)
where n.weight = - m.weight
with n.weight as nw, m.weight as mw
, count(*) as c
, collect(n) as nc, collect(m) as mc
return nw+mw+c as expr, nc, mc
Option 3
The solution in the option lies in replacing aggregation function call by the neutral element of the operator which binds aggregation. Replacement should be applied in mixed (aggregation and non-aggregation) expression only and if possible only. E.G. in the following expression:
e1 + count(*) + e2 * count(*) + e3
for designation of aggregation key aggregation function call should be replaced in following way:
e1 + 0 + e2 * 1 + e3
and that expression becomes an aggregation key component. In above example aggregation function call bound by arithmetic "+" was replaced by "0" (addition neutral element) and aggregation function call bound by arithmetic "*" was replaced by "1" (multiplication neutral element).
The general rule, for which expression replacement is possible is as the following: the shape of groups for evaluated grouping key is independent of the value of constant we place for aggregation function call. For example, let's consider expression
e1 + avg(e2)
where e1 multiset of values is {1,1,1,2,2,3}, what generates 3 groups of size 3, 2 and 1. Let's notice that
shape of groups does not depend on what constant avg(e2)
will be replaced by. E.g.: if we replaced avg(e2)
by constant 10, multiset {11,11,11,12,12,13} determine identical set of 3 groups of sizes 3, 2 and 1.The example shows situation the replacement is possible.
Let's now consider expression:
list[e1..min(e2)]
In this case, value of constant that would replace min(e2)
might influence evaluated value of expression and shape of groups, so we want to deny using such constructions with message „grouping expression is not independent of aggregating functions”.
So, the general rule of the possibility of replacement (and subsequent designation of aggregation key) could be formulated as follows:
Replacement of aggregation function call by a constant is possible if selection of the constant value does not change groups selected by the constructed key.
Existing examples
For such database:
create (:L {a:1,b:2,c:3})
create (:L {a:2,b:3,c:1})
create (:L {a:3,b:1,c:2})
there are two following (basically equivalent) queries that give two different results.
-
match (x:L) return x.a + count() + x.b + count() + x.c as e;
-
match (x:L) return x.a + x.b + x.c + count() + count() as e;
When the option 3 would be applied result of first changes and both queries would return:
+----------------- ----+
| e |
+------- --------------+
| 12 |
+----------------------+
Following queries results are the same (and intuitive) as in current neo4j implementation:
match (n) return n.weight, count(*)
match (n) return n.weight as w, 2count() as c2
match (n) return abs(n.weight) as abs, count(*) as count
In the following query, replacement mechanism would be applied with full power and originally query
match (n: Foo), (m:Bar) where n.weight = - m.weight return n.weight+m.weight+count(*) as expr , collect(n) as nc, collect(m) as mc
would have the same semantic as transformed ("explicit" group by) version:
match (n: Foo), (m:Bar) where n.weight = - m.weight with n.weight + m.weight as as nwmw , count(*) as c , collect(n) as nc, collect(m) as mc return nwmw + c as expr, nc, mc
@Mats-SX how to proceed, i.e. how to decide on the aggregation semantics/grouping key selection for openCypher?
@jmarton That is for the oCIG to decide, which will have its first meeting in June. A schedule will be broadcasted shortly (website, google groups, Slack channels); you are of course all invited.
The exact format of the decision-making process within the meeting is not set in stone. I imagine something similar to having one or several propositions, followed by a discussion, and eventually the reaching of a consensual decision.
@jmarton & @JanekPo Please be advised that this CIR (and others in the general area of aggregation and grouping) will be discussed at the first oCIG meeting, as detailed here: http://www.opencypher.org/ocig. In light of this, please can the two of you be ready to present - if necessary - this CIR and your options at the meeting?
Thanks!
( @JanekPo I have added your name to the agenda, but the build process is very delayed at the moment, which is why your name is not published at the time of writing)
This topic came up at Neo4j recently, and we had a discussion about it in our internal Cypher group.
We concluded that the Neo4j opinon on the matter is that expressions containing an aggregation should not be allowed. Instead queries should clearly separate the grouping key from the aggregation.
We thus propose that the example queries above should be written as follows:
Query 1:
MATCH (n)
RETURN n.weight, count(*)
Here the grouping key is n.weight
, and the resulting count will be the number of nodes with the same weight for each distinct weight in the graph:
╒══════════╤══════════╕
│"n.weight"│"count(*)"│
╞══════════╪══════════╡
│"2" │"2" │
├──────────┼──────────┤
│"1" │"2" │
├──────────┼──────────┤
│"-2" │"2" │
├──────────┼──────────┤
│"0" │"2" │
├──────────┼──────────┤
│"-1" │"2" │
└──────────┴──────────┘
Query 2:
MATCH (n)
RETURN abs(n.weight) AS abs, count(*) AS count
Here the grouping key is abs(n.weight)
, and the resulting count will be the number of nodes with the same absolute value for weight for each distinct absolute value of weight in the graph:
╒═════╤═══════╕
│"abs"│"count"│
╞═════╪═══════╡
│"2" │"4" │
├─────┼───────┤
│"1" │"4" │
├─────┼───────┤
│"0" │"2" │
└─────┴───────┘
Query 3:
Variant a:
MATCH (n: Foo), (m:Bar)
WHERE n.weight = - m.weight
WITH n.weight + m.weight AS combined_weight, count(*) AS c, collect(n) AS nc, collect(m) AS mc
RETURN combined_weight + c AS expr, nc, mc
Here the grouping key is n.weight + m.weight
, and the resulting count will be the number of node pairs where the combined weight sums up to the same value (which will be zero because of the WHERE):
╒══════╤══════╤══════╕
│"expr"│ "nc" │ "mc" │
╞══════╪══════╪══════╡
│ 5 │ ... │ ... │
└──────┴──────┴──────┘
Variant b:
MATCH (n: Foo), (m:Bar)
WHERE n.weight = - m.weight
WITH n.weight AS nw, m.weight AS mw, count(*) AS c, collect(n) AS nc, collect(m) AS mc
RETURN nw + mw + c AS expr, nc, mc
Here the grouping key is n.weight, m.weight
, and the resulting count will be the number of node pairs where the tuple of weights is the same:
╒══════╤══════════════════════════╤═══════════════════════════╕
│"expr"│"nc" │"mc" │
╞══════╪══════════════════════════╪═══════════════════════════╡
│ 1 │[{"weight":"2","id":"5"}] │[{"weight":"-2","id":"11"}]│
├──────┼──────────────────────────┼───────────────────────────┤
│ 1 │[{"weight":"1","id":"4"}] │[{"weight":"-1","id":"12"}]│
├──────┼──────────────────────────┼───────────────────────────┤
│ 1 │[{"weight":"-1","id":"2"}]│[{"weight":"1","id":"14"}] │
├──────┼──────────────────────────┼───────────────────────────┤
│ 1 │[{"weight":"0","id":"3"}] │[{"weight":"0","id":"13"}] │
├──────┼──────────────────────────┼───────────────────────────┤
│ 1 │[{"weight":"-2","id":"1"}]│[{"weight":"2","id":"15"}] │
└──────┴──────────────────────────┴───────────────────────────┘
We have published a paper on it in 2019: https://dl.acm.org/doi/10.1145/3315507.3330196 I was sure there is an open access to the paper, but it seems it's not. Please let me know if anyone is interested in full version.
@JanekPo I would be glad to receive a copy of the full paper. Are you okay with attaching a copy in a comment to this thread? Otherwise you can find my email in my Github profile.