openCypher icon indicating copy to clipboard operation
openCypher copied to clipboard

Resolving MERGE being dependant on evaluation order

Open thobe opened this issue 7 years ago • 11 comments

CIR-2018-296

Currently, in Cypher 9, the result of a MERGE statement depends on the evaluation order - i.e. the order of rows coming in to the MERGE statement (through the "driving table" of the MERGE statement). This is an undesirable situation, because it means that formally defining the semantics of MERGE becomes more involved. It it also undesirable because in order to know the outcome the driving table needs to be ordered, which makes queries more complex and more expensive.

It would be beneficial if we could improve the semantics of MERGE so that the behaviour does not depend on the evaluation order of the driving table.

Any changes to the semantics of MERGE need to take care not to lose the ability of solving the cases MERGE was designed to solve:

  • Match or create a node such that there is only a single node in the graph that matches the given pattern.
  • Match or create a relationship between two nodes, such that there is only a single matching relationship between the two nodes.

MERGE extends these functionalities needed to support these use cases to larger patterns than single entities, and it is often - but not only - there that the dependency on evaluation order becomes apparent.

Example 1:

MERGE (a)-[:KNOWS]-(b) with driving table:

a b
Node[1] Node[2]
Node[2] Node[1]

The direction of the created relationship is currently defined to be "left to right" (from a to b in this case) for the first encountered row. While this will match the pattern regardless of which node is a and which is b, the result of the clause depends on the order of the driving table. With the order stated above, a relationship will be created from Node[1] to Node[2], but if the order had been the opposite a relationship from Node[2] to Node[1] would be created.

Example 2:

Given graph created by CREATE (:A {num:1})-[:R]->(:A {num:2}) Execute query MATCH (n:A) MERGE (n)-[:R]->()-[:R]->()

If n matches (:A{num:1}) first and (:A{num:2}) second, the result will be that MERGE will create the pattern for both input rows. If however n first matches (:A{num:2}) and (:A{num:1}) second, the creation of the pattern starting from (:A{num:2}) will create enough nodes and relationships for the pattern to match when n is (:A{num:1}).

Proposed solution

I first heard of the basics of this solution from the group at University of Edinburgh, working on defining the semantics of Cypher.

The basic idea is to not allow MERGE to read its own writes. So instead of MERGE directly creating entities, one can think of it as only making note of what to create (the rows for which there are no matches), and then once the entire driving table has been seen the creations are performed.

If that is all that is done, MERGE would not solve the initial use cases, since for example MERGE (:X{y:z}) would create more than one node given a driving table that contains z=3 more than once. The use case MERGE intends to solve is for such cases to only create a single node for all such rows.

In order to address this as well, I propose that the creations during the create-phase of MERGE would perform idempotent creations, such that creations with the same inputs would all be collapsed to the same entities.

If we consider Example 2 above under these semantics, we would see the pattern created for both nodes under any evaluation order.

In order to ensure only a single relationship being created in Example 1, and to ensure that the direction of that relationship is the same under any evaluation order, I further propose that the creation direction of an undirected pattern in MERGE be based on the order of the nodes (as in the order ORDER BY on a column containing the nodes bound to a and b would produce) instead of the lexical order of the pattern.

ON MATCH and ON CREATE

Still remaining as dependant on evaluation order is the ON MATCH and ON CREATE subclauses of MERGE. The redefinition of MERGE as acting in two phases: one matching or marking for creation, and one performing the creations with the creations based on equal inputs being merged to create the same entities, means that multiple rows now get evaluated as having created the entity, even if it is the same entity for those rows. The ON CREATE branch would thus be taken for multiple rows that create the same entity. This might seem confusing, but actually has some benefit since it means that the ON MATCH and ON CREATE are no longer as dependant on evaluation order.

thobe avatar Feb 01 '18 22:02 thobe

This improvement request might be a bit more involved than a typical CIR, by specifying an actual proposal. But since these thoughts are not on a mature enough level of being able to write a full improvement proposal yet, I thought this was an appropriate start.

thobe avatar Feb 01 '18 22:02 thobe

There is a further question relating to ON MATCH and ON CREATE - should they be allowed to read the writes of other ON MATCH and ON CREATE rows. There might be some use case for that. Such as MERGE (n:X{y:z}) ON MATCH n.m = coalesce(n.m,0)+1 ON CREATE n.c = coalesce(n.c,0)+1.

This needs to be resolved, but I think the above proposal might be a good start.

thobe avatar Feb 01 '18 22:02 thobe

Thanks for the very clear write up. This is a good start.

Regarding Example 1, this feels like it could also be resolved by removing the other source of nondeterminism and insisting that anydirected relationships are always created deterministically (e.g. from smaller to larger node id).

Regarding the collapsing behavior of MERGE suggested here, I'd like to understand better to which degree this still allows streaming processing, especially if created items are bound.

Imagine

MATCH (n) MERGE (n)-[:X]->(m {prop: n.prop})

I think this can still be processed in a streaming fashion (when we decide to create an m we can already determine the node id and bind it, even though this is not visible inside the merge, and even though later records might still reach the same conclusion) but it would probably be helpful to think through more cases and also better understand the implied memory pressure.

boggle avatar Feb 01 '18 22:02 boggle

@boggle

Regarding Example 1, this feels like it could also be resolved by removing the other source of nondeterminism and insisting that anydirected relationships are always created deterministically (e.g. from smaller to larger node id).

Yes, that is exactly what I propose for solving that case, read the paragraph that starts with In order to ensure only a single relationship being created in Example 1.

Regarding the collapsing behavior of MERGE suggested here, I'd like to understand better to which degree this still allows streaming processing, especially if created items are bound.

Imagine

MATCH (n) MERGE (n)-[:X]->(m {prop: n.prop})

I think this can still be processed in a streaming fashion (when we decide to create an m we can already determine the node id and bind it, even though this is not visible inside the merge, and even though later records might still reach the same conclusion) but it would probably be helpful to think through more cases and also better understand the implied memory pressure.

I think so too. The cases where an implementation cannot stream data through MERGE would be the same as for other write operators: When read clauses that come later might read the data written by MERGE, in order for that to work all writes need to be performed before the first read can start. That is what I like the most about this proposal: it makes MERGE the same as all other writing clauses in every way (except of course this one thing that makes it special of "collapsing" equivalent writes).

thobe avatar Feb 02 '18 09:02 thobe

That makes me wonder if it might be (conceptually possible) to express MERGE using subquery operators like this:

MATCH (a)
MATCH {
  MATCH {
    MATCH (a)-[-[:X]->(m {prop: n.prop})
    RETURN n, m
  }
  OTHERWISE // query-level xor that has been discussed in the past
  MATCH {
    CREATE (a)-[:X]->(m {:prop: n.prop}
    RETURN n, m
  }
}

This shows where the problem is: It still would create duplicates and the only way to emulate MERGE that I could see would be a graph-level squashing operation of similarly looking entities. Event that would still not be the same, giving a real argument why MERGE is a core feature.

boggle avatar Feb 02 '18 09:02 boggle

In the Cypher 9 draft (still not merged) we decided to not include ON CREATE and ON MATCH. In the light of that, perhaps it is useful for this discussion to not consider them.

Mats-SX avatar Feb 02 '18 12:02 Mats-SX

Wasn't the main reason for that the order dependence? If we can fix that, wouldn't it be great to make them a proper part of the language?

boggle avatar Feb 04 '18 09:02 boggle

It might be orthogonal to this debate, but these subclauses energise another source of non-determinism in Cypher 3.3 queries; those stemming from SET:

Consider this scenario:

Given an empty graph
  And having executed: CREATE (:A {p: 0}), (:A {p: 1})
When executing query: 
  """
  MATCH (a:A) 
  MERGE (b:B) 
    ON CREATE SET b.value = a.p 
  RETURN b.value
  """
Then the result should be:
  | b.value  |
  | ??? |
  | ??? |

Classic race condition: will the created :B node have a property value of 0 or 1? The row that creates will decide. How would this be handled idempotently?

One idea could be not not allow dynamic (row-dependent) writes in the subclauses, but it seems like a very limiting factor.

Mats-SX avatar Feb 05 '18 08:02 Mats-SX

@Mats-SX it would indeed be interesting not to allow it - fail with an error saying ON CREATE evaluated more than one time for the same pattern with different outcome.

But you are right that it is orthogonal to MERGE, since the following illustrates a similar problem:

Given an empty graph
  And having executed: CREATE (:A{x:1}), (:A{x:2}),(:B)
When executing query:
  """
  MATCH (a:A), (b:B)
  SET b.x = a.x
  """
The result of executing the query:
  """
  MATCH (b:B)
  RETURN b.x
  """
Should be:
  | b.x |
  | ??? |

thobe avatar Feb 06 '18 13:02 thobe

There is an opportunity with @thobe's proposal to fix a current asymmetry.

CREATE currently demands that the direction of a relationship be specified. The fix suggested here causes a "consistently arbitrary" selection of direction to be made, when the implied CREATE within MERGE is based on a pattern that is "direction-neutral" or "direction-indifferent". (The selection of direction is necessary because Cypher does not currently support a mixed graph data model, and does not have directionless | undirected | bidirectional relationships).

The same approach could be taken in the CREATE clause, allowing a direction-neutral pattern (with no arrow-head to indicate direction) to operate successfully. This would allow statements like

CREATE (:Person {name:'Alastair'})-[:KNOWS]-(:Person {name:'Tobias'})

which currently fail in the Neo4j implementation of Cypher 9 with the following error:

Only directed relationships are supported in CREATE

This behaviour would be forward-compatible with a future extension to support undirected/bidirected relationships, in which case an implementation would no longer have to select an arbitrary direction, but the syntax of such a query would not have to alter.

alastai avatar Feb 08 '18 08:02 alastai

The ambiguity sensitive mode has been switched on in Cypher.PL executable specification. As the source of ambiguity in openCyper is generally reducing semantics of unordered set (of matches) to semantics of ordered list, now in Cypher.PL set of matches has no order.

While examining the correctness of the mode new class of ambiguity has been discovered. The class contains of writing queries with pattern comprehension expression on "right" side, which is mixing reading and writing in a single clause (similarly to MERGE, where read/write mixing is in definition clause). This causes race of reading and writing actions with result dependent on the order in driving table. We have two analogous examples for SET and CREATE clauses (where maximum function returns the greatest value on list).

Sample for SET clause:

create (a {num: 1})-[r1:T]->(b {num : 2})-[r2:T]->(c {num : 3})  // preparing database
 match (x)-[z]->(y) // calculating driving table 
 set x.num = maximum([(k)-->(l) | l.num]) + 10 // clause with both reading and writing
 return x.num as xnum  //ambiugous result 
+--------+
|  xnum  |
+--------+
|   23   |
+--------+
|   13   |
+--------+
+--------+
|  xnum  |
+--------+
|   13   |
+--------+
|   13   |
+--------+

The same in case of CREATE clause:

create (a {num: 1})-[r1:T]->(b {num : 2})<-[r2:T]-(c {num : 3}) 
match (x)<-[z]-(y) 
create (y)<-[:T]-(w {num: maximum([(k)-->(l) | l.num]) + 10}) 
return w.num as wnum
+--------+
|  wnum  |
+--------+
|   12   |
+--------+
|   13   |
+--------+
+--------+
|  wnum  |
+--------+
|   12   |
+--------+
|   12   |
+--------+

You can test your examples in Cypher.PL console console info Please note, that due to ambiguity sensitivity the console is now stateless - does not implement Schrödinger's database. We have plans to examine all TCK queries for being ambiguous to detect other classes of ambiguity.

JanekPo avatar Apr 11 '18 08:04 JanekPo