openCypher
openCypher copied to clipboard
Make reachability matching default for variable length path patterns
CIR-2017-08-01
Make reachability matching default for variable length path patterns (the star in edge patterns, and generalized in RPQs. Allow all paths matching as an option for variable length path queries.
Terminology
Reachability Matching
For a pattern (x)-<VariableLengthPathPattern>-(y)
, every pair (x, y) is a result if x matches (x)
, y matches (y)
, and there exists a path (at least one) between x and y that matches the <VariableLengthPathPattern>
.
The result set is a set of all pairs (x, y) — no duplicates.
Reachability matching queries for pairs of nodes that are connected by a path confirming to a given pattern.
Reachability matching at worst produces n^2 results on an graph with n nodes and evaluation is tractable, cf. NL (complexity)
All Paths Matching
For a pattern (x)-<VariableLengthPathPattern>-(y)
, every triple (x, p, y) is a result if x matches (x)
, y matches (y)
, and p is a path between x and y and matches the <VariableLengthPathPattern>
.
The result set is a set of all triples (x, p, y) — no duplicates.
All paths matching queries for all actual paths that confirm to a given pattern.
All paths matching easily becomes intractable, cf. Arenas et al. Counting beyond a Yottabyte, or how SPARQL 1.1 property paths will prevent adoption of the standard. WWW, 2012
Current Situation
Let's illustrate the current situation with a simple example on the movie graph that ships with Neo4j.
MATCH (c {name: 'Clint Eastwood'})-[*]-(x) RETURN COUNT(x)
Intuitively, most users – particularly non-expert users – would expect that the query returns the number of nodes with some connection to Clint Eastwood (reachability matching).
The Movie Graph has 171 nodes, which makes at most 171 potential nodes that could qualify and in any case a single result record.
That should be easily manageable by a graph database system.
Actually however, it is hard to see this query returning.
It is a pretty bad user experience.
An apparently super simple query on a tiny graph does not return. (MATCH (y)-[*]-(x) RETURN COUNT(*)
would be an even harsher example.)
Let's try to make the query work somehow. With an upper bound the query returns after a couple of seconds (on a standard mac book).
MATCH (c {name: 'Clint Eastwood'})-[*..11]-(x) RETURN COUNT(x)
Result is 1843134 (all paths matching).
Using DISTINCT
finally gives the expected query result in milliseconds.
MATCH (c {name: 'Clint Eastwood'})-[*..11]-(x) WITH DISTINCT x RETURN COUNT(x)
It is clear why. Now the query explicitly states (in an odd way though) that we do not care about the actual paths but only about the reachable nodes. Note, that before that we stated that implicitly already. The query never really asked for paths. No path variable was ever given. (BTW: The query with DISTINCT and without bounds does not return quickly.)
Problem
All paths matching is a very sensitive thing. It can quickly become super expensive for topology reasons a user likely is not aware of. Hence, variable length path matching — as currently designed in openCypher, with all paths matching as default — makes it very simple for the user to accidentally write queries that are super expensive and never return even on tiny graphs.
Good language design should do the opposite. It should limit the use to cheap query that return in reasonable time. It should allow the user to run also expensive operations if the user explicitly asks for them. Super expensive operations — such as all paths matching is on most graphs — should be opt-in for the user. In openCypher it is opt-out.
On top of that, the opt out mechanism with the DISTINCT
does not working perfectly somehow, otherwise the query without bounds should also return in reasonable time.
The solution with DISTINCT
is a very indirect way to express that we do not care about the actual path but just about the reachability.
Likely, the system gets it wrong in some cases.
It is not the system that is to blame for that.
The query language design should simply be better and should not force the database system in such guessing about the user’s intention.
Request
Variable length path patterns (x)-<VariableLengthPathPattern>-(y)
should by default query for reachability, i.e. have reachability matching as their default semantics.
Only if the user explicitly asks for the system should switch to a more complex semantics.
This could be designed in a two-step escalation.
- Only if the user assigns a path variable
p=( (x)-<VariableLengthPathPattern>-(y) )
, a single simple shortest path is bound to that variable (hop distance, non-defined choice between equal length shortest paths). - Only if the user assigns a path variable and explicitly asks for all paths e.g.
ALL p=(x)- <VariableLengthPathPattern>-(y)
(syntax just for illustration), all paths should be returned. Here the potentially super expensive all paths matching is opt-in, so it is unlikely to be used accidentally.
This CIR argues, that the default should be the significantly less dangerous reachability matching than the current all paths matching. Give the pain only to the ones who ask for and save the innocent. That is the idea.
Possible Transition
A drawback of this proposal is, that it changing the default matching semantics of the start *
in edge patterns probably break existing user queries on (sub)graphs there the intractability of all paths matching is not a problem, e.g. trees.
With the considered integration of RPQs into openCypher, the star in edge patterns becomes a redundant functionality anyway. Hence, the following may be a possible transition: Design RPQs with reachability matching as default and all paths matching as opt in. Mark the star in edge patterns as deprecated to discourage users from using it and rewriting their queries to RPQs.
Not In Scope
This CIR is only concerned with variable length path matching (the star, and generalized RPQs).
Fixed-length pattern (aka conjunctive queries) are fine as they are.
Changing (a)--()--(b)
to reachability matching between (a)
and (b)
is likely not intuitive to most users.
But more importantly, the user cannot run so easily into complexity pit falls with conjunctive queries.
Thank you for this CIR, very interesting read and clearly explained. I have one question: do you think it is an important feature that the evidence path bound if a path variable is given is a shortest path?
Thank you for the CIR, it's an interesting proposal and the issue has come up at least for first-time users. To address this, we'd need to come up with a suitable migration strategy though for existing queries.