openCypher
openCypher copied to clipboard
Make OPTIONAL MATCH and WHERE interaction easier to understand
CIR-2017-191
The fact that WHERE does not filter rows when part of an OPTIONAL MATCH has confused many users. Semantically, OPTIONAL MATCH behaves like an outer join, and will only ever be additive (>= 0) to the cardinality of the preceding query. Consider the below example query, and note the comment:
MATCH (a)
OPTIONAL MATCH (a)-->(b)
WHERE false // does not filter, only makes b null
RETURN b
In contrast, moving the WHERE subclause to be attached to the MATCH clause, the query will return an empty result:
MATCH (a)
WHERE false // filters everything
OPTIONAL MATCH (a)-->(b)
RETURN b
It has not been clear to users that WHERE is intimately attached to its preceding clause, and that it only affects the cardinality of that clause, rather than the whole query (although this difference does not exist in the cases of WITH and MATCH). This CIR requests some way of making this more clear to the user.
Idea 1
Prohibit WHERE subclauses that do not depend on variables bound by its superclause. This would make the example queries in the description of this CIR invalid.
This idea could be 'soft', and allow predicate expressions that partially depend on the bound variables:
MATCH (a)
WHERE a.prop AND false
...
Or it could be 'strict', and only allow predicate expressions without constant parts.
Idea 2
Replace the OPTIONAL MATCH clause with a subquery clause; OPTIONAL. The first query from the description would be expressed like so:
MATCH (a)
OPTIONAL { // explicitly states the semantics of its sub-part
MATCH (a)-->(b)
WHERE false // normal filtering
RETURN b
}
RETURN b
Ideas 1 and 2 are not mutually exclusive.
+1 on this CIR. Although Idea 1 has the advantage of being far less drastic in terms of existing usage, I must say the synergy of Idea 2 and subqueries is very compelling indeed...
+1 for Idea 2, I like the clarity this provides.
I would add Idea 0: "Take no action" - as the semantic of WHERE tied to OPTIONAL MATCH is strict (and analogous to SQL outer join). Furthermore, everything might change after adding Regular Path Queries to Cypher (https://github.com/opencypher/openCypher/issues/179)
I stumbled upon this problem recently and just found this issue. I ran a couple of experiments to better understand the semantics of OPTIONAL MATCH.
I used the following toy graph:
CREATE
(:X {val: 1})-[:E1]->(:Y {val: 2})-[:E2]->(:Z {val: 3}),
(:X {val: 4})-[:E1]->(:Y {val: 5}),
(:X {val: 6})
Matching on a single edge is fairly simple:
MATCH (x:X)
OPTIONAL MATCH (x)-[:E1]->(y:Y)
WHERE x.val < y.val
RETURN x, y
╒═════════╤═════════╕
│"x" │"y" │
╞═════════╪═════════╡
│{"val":1}│{"val":2}│
├─────────┼─────────┤
│{"val":4}│{"val":5}│
├─────────┼─────────┤
│{"val":6}│null │
└─────────┴─────────┘

Things get more complicated with two edges, especially for the query plan, which contains an Apply and an Optional operator.
MATCH (x:X)
OPTIONAL MATCH (x)-[:E1]->(y:Y)-[:E2]->(z:Z)
WHERE x.val < z.val
RETURN x, y, z
╒═════════╤═════════╤═════════╕
│"x" │"y" │"z" │
╞═════════╪═════════╪═════════╡
│{"val":1}│{"val":2}│{"val":3}│
├─────────┼─────────┼─────────┤
│{"val":4}│null │null │
├─────────┼─────────┼─────────┤
│{"val":6}│null │null │
└─────────┴─────────┴─────────┘

Splitting the OPTIONAL MATCH into two clauses results in different semantics:
MATCH (x:X)
OPTIONAL MATCH (x)-[:E1]->(y:Y)
OPTIONAL MATCH (y)-[:E2]->(z:Z)
WHERE x.val < z.val
RETURN x, y, z
╒═════════╤═════════╤═════════╕
│"x" │"y" │"z" │
╞═════════╪═════════╪═════════╡
│{"val":1}│{"val":2}│{"val":3}│
├─────────┼─────────┼─────────┤
│{"val":4}│{"val":5}│null │
├─────────┼─────────┼─────────┤
│{"val":6}│null │null │
└─────────┴─────────┴─────────┘

Note that the TCK currently does not fully cover this issue. The only that have two edges in an OPTIONAL MATCH are the following - note that none have a WHERE clause:
$ ag 'OPTIONAL MATCH.*\(.*\(.*\(' -C 2
MatchAcceptance2.feature
316- """
317- MATCH (a {name: 'A'})
318: OPTIONAL MATCH (a)-[:KNOWS]->()-[:KNOWS]->(foo)
319- RETURN foo
320- """
--
354- """
355- MATCH (a {name: 'A'})
356: OPTIONAL MATCH p = (a)-->(b)-[*]->(c)
357- RETURN p
358- """
OptionalMatchAcceptance.feature
258- """
259- MATCH (a:Single), (c:C)
260: OPTIONAL MATCH (a)-->(b)-->(c)
261- RETURN b
262- """
--
270- """
271- MATCH (a:A), (c:C)
272: OPTIONAL MATCH (a)-->(b)-->(c)
273- RETURN b
274- """
We are happy to write some additional TCK tests if you agree that they'd be useful. The queries above might provide a good starting point for that.
Update: I checked the SIGMOD 2018 Cypher paper on the matter:

This, as expected, captures the semantics of the language correctly, but is rather difficult (but not impossible) to translate to standard relational algebra.
@szarnyasg Thanks for raising this, and please do write the additional TCK tests for this (much appreciated).
@szarnyasg Interesting analysis, and yes we would of course be interested in such a TCK contribution (as Petra already mentions).
You write that the TCK 'does not fully cover this issue' -- I wonder to which issue you are referring? The general issue to which this CIR is dedicated (OPT MATCH + WHERE giving allegedly surprising behaviour), or another issue in particular? Are you perhaps looking for being able to express predicates covering multiple OPTIONAL MATCH clauses, such as in your last query?
@Mats-SX I was referring to the OPTIONAL MATCH + WHERE constellation (which is the focus of this CIR). This code search reveals that there are a few tests for multiple OPTIONAL MATCH clauses:
$ ag 'OPTIONAL MATCH.*\n.*OPTIONAL MATCH' -C 2
MatchAcceptance2.feature
1646- """
1647- MATCH (a:A)
1648: OPTIONAL MATCH (a)-[:FOO]->(b:B)
1649: OPTIONAL MATCH (b)<-[:BAR*]-(c:B)
1650- RETURN a, b, c
1651- """
OptionalMatchAcceptance.feature
97- """
98- MATCH (a:Single)
99: OPTIONAL MATCH (a)-->(b:NonExistent)
100: OPTIONAL MATCH (a)-->(c:NonExistent)
101- WITH coalesce(b, c) AS x
102- MATCH (x)-->(d)
--
282- """
283- MATCH (a:A), (b:B)
284: OPTIONAL MATCH (a)-->(x)
285: OPTIONAL MATCH (x)-[r]->(b)
286- RETURN x, r
287- """
--
309- When executing query:
310- """
311: OPTIONAL MATCH (a:NotThere)
312: OPTIONAL MATCH (b:NotThere)
313- WITH a, b
314- OPTIONAL MATCH (b)-[r:NOR_THIS]->(a)
--
329- When executing query:
330- """
331: OPTIONAL MATCH (f:DoesExist)
332: OPTIONAL MATCH (n:DoesNotExist)
333- RETURN collect(DISTINCT n.property) AS a, collect(DISTINCT f.property) AS b
334- """
So the "issue" is the low level of TCK coverage for the feature? Then I understand.
I think our "issue" was that OPTIONAL MATCH + WHERE is unintuitive at first sight to many users, and the TCK does not cover some important sub-cases (such as OPTIONAL MATCH on two edges + WHERE). In the absence of these tests, implementations might get this wrong - which is exactly what happened to us when working on similar queries in ingraph.
I have checked queries you had posted - the results are expectedly accordant with Cypher's semantic, because of match clause definitions corresponding to 1-4 formulas that you had pasted:
eval_clause(Match,Graph,BindingTable,MatchBindingTable,Graph) :-
Match = match(MatchModifier,Pattern,Where),
findall(MatchBindingRow,
(
member(BindingRow,BindingTable),
eval_match_(MatchModifier,Graph,Pattern,Where,BindingRow,MatchBindingRow)
),
MatchBindingTable).
eval_match_(no_modifier,Graph,Pattern,Where,BindingRow,MatchBindingRow)
:-
eval_match(BindingRow,Graph,Pattern,MatchBindingRow),
eval_where(Graph,Where,MatchBindingRow).
eval_match_(optional,Graph,Pattern,Where,BindingRow,MatchBindingRow)
:-
findall(MatchBindingRow_,
(
eval_match(BindingRow,Graph,Pattern,MatchBindingRow_),
eval_where(Graph,Where,MatchBindingRow_)
),
MatchBindingRows_),
MatchBindingRows_ = [_|_],
!,
member(MatchBindingRow,MatchBindingRows_).
eval_match_(optional,_Graph,Pattern,_Where,BindingRow,MatchBindingRow)
:-
get_names(Pattern,Names),
list_to_set(Names,NamesSet),
maplist([Name,bind(Name,cypher_null)]>>true,NamesSet,Binds),
extend_binding_row(BindingRow,Binds,MatchBindingRow).
If you wish you can verify your respective query results in Cypher.PL Please be aware that - due to ambiguity handling - Cypher.PL console is currently running in stateless mode, it means that graph state is not preserved between query execution. Thus you have to create graph state in the same query e.g.
CREATE //graph state creation
(:X {val: 1})-[:E1]->(:Y {val: 2})-[:E2]->(:Z {val: 3}),
(:X {val: 4})-[:E1]->(:Y {val: 5}),
(:X {val: 6})
with count(*) as unused_variable_
MATCH (x:X) //tested query
OPTIONAL MATCH (x)-[:E1]->(y:Y)
WHERE x.val < y.val
RETURN x, y
Cypher.PL console is available at:
ssh [email protected] #password: cypherpl
In case of any incompatibilities with your results please let me know.
Thanks, @JanekPo for investigating this with your implementation. Neo4j returns the correct results indeed, but it is worth trying to make the interaction of OPTIONAL MATCH and WHERE more intuitive (and thus making the language more accessible). Simultaneously, the TCK should document such complex interactions of language constructs, which I tried to address with #333.