neomodel icon indicating copy to clipboard operation
neomodel copied to clipboard

neo4j.exceptions.ClientError: No write operations are allowed directly on this database. Writes must pass through the leader. The role of this server is: FOLLOWER

Open robertlagrant opened this issue 6 years ago • 12 comments

A funny one:

  • We have a 3-node causal-clustered neo4j setup
  • I've changed the routing protocol to be bolt+routing
  • We're using Neomodel with @db.transaction

We're getting intermittent errors as per the issue title - i.e. it's trying to write to a follower node, and presumably bolt+routing isn't sending the transaction to the leader. Am I missing something? Is it that if the first interaction with the database is a read, that it opens the transaction on a follower node? Can I force it to the leader for every transaction?

robertlagrant avatar May 15 '18 08:05 robertlagrant

We are still getting this issue, even when forcing a write transaction.

I've created a repro case: https://github.com/robertlagrant/neo4j-cluster-failure. Please test.

robertlagrant avatar Jan 04 '19 22:01 robertlagrant

@robertlagrant Would it be possible to share a little bit more information on your cluster configuration? Is that supposed to be 3 CORE servers? There are some conditions where what you describe might be the intended behaviour at least as far as RAFT is concerned (i.e. see this). I am trying to see how much of this can be dealt with at the level of neomodel and how much of this is external to it.

aanastasiou avatar Jan 09 '19 10:01 aanastasiou

Please see https://neo4j.com/docs/ogm-manual/current/reference/ (section 3.14.1.6. Retry mechanisms).

For critical applications, these failures have to be anticipated, and also managed at the architecture or application level. Even if the driver handles some low level retries, it is not always enough in case of instability, as an application may involve complex business logic, and require coarse grained units of work.

In other words, the driver does not deal with higher level failures (such as cluster disconnects). In our use cases we have worked around this by adding custom retry logic to our business logic. See very basic example down below (adding jitter and exponential backoff obviously highly recommended).

sts = time.time()
while True:
    last_exception = None
    cts = time.time()

    if cts - sts > _MAX_RETRY_SECONDS:
        raise last_exception

    try:
        session.write_transaction(do_write())
        break
    except Exception as e:
        time.sleep(1)
        last_exception = e

mvanderkroon avatar Jan 09 '19 10:01 mvanderkroon

@mvanderkroon Thank you very much, sounds like a modification is required at this point (?).

aanastasiou avatar Jan 09 '19 12:01 aanastasiou

@aanastasiou I believe so. I have forked the repo, made the necessary changes and would be quite happy to issue a pull request. Should I point it to your master branch?

mvanderkroon avatar Jan 09 '19 13:01 mvanderkroon

@mvanderkroon Thank you very much and I do not see why not. It should be sent as a pull request to the main neomodel repo. All the best.

aanastasiou avatar Jan 09 '19 13:01 aanastasiou

@aanastasiou sure - it's a 3 core server cluster. There are also 2 read replicas, but they don't really feature in this situation as far as I'm aware.

robertlagrant avatar Jan 14 '19 00:01 robertlagrant

@robertlagrant Thank you for your response, I think that the discussion with @mvanderkroon on the pull request was very informative about the specifics.

aanastasiou avatar Jan 14 '19 15:01 aanastasiou

Why follower cannot accept writes?

kant111 avatar Aug 02 '19 07:08 kant111

@kant111 because that's not how Neo4J works.

robertlagrant avatar Mar 11 '20 09:03 robertlagrant

when using a connection URL of bolt+routing:// this indicates the session is now cluster aware, whereas bolt:// does not understand the other members in a cluster. However it is not simply the bolt+routing:// connection URL is only half the story. It is also the usage of session.readTransaction() and session.writeTransaction() whereby each allows you to pass the Cypher to be executed. If you send a cypher statement through session.writeTransaction and the connection URL was bolt+routing:// then regardless of the member connected to, the Cypher write statement will be routed to the LEADER. As such if one connects to bolt+routing://<followerIP> and calls a session.writeTransaction() as the transaction is defined as a write it will automatically be routed to the LEADER. It is important to note that Neo4j does not parse the Cypher statement to auto detect if the Cypher is a read or write statement. So one could actually issue a session.readTransaction("create (n:Person {id:1})") and because it is defined as a 'readTransaction` it would be routed to a Follower, but then fail since only LEADERs can perform writes.

ayoubelmimouni avatar Dec 13 '20 16:12 ayoubelmimouni

Fun fact (tested on Neo4J 4.0.7)

Adding a trigger can only be done on the node in the cluster that is the LEADER of both the DB you are adding the trigger to AND the system database (might need the neo4j DB as well, wasn't sure, but we don't use it).

The example below is me trying to add a trigger whilst connected to the node neo4j-core-2 via the bolt connector

neo4j@nextvoice> call dbms.cluster.overview();
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id                                     | addresses                                                                                                                | databases                                                      | groups |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| "53f95bdf-0c86-4826-8244-4ad4f7963592" | ["bolt://neo4j-core-2.neo4j.default.svc.cluster.local:7687", "http://neo4j-core-2.neo4j.default.svc.cluster.local:7474"] | {nextvoice: "LEADER", neo4j: "FOLLOWER", system: "FOLLOWER"}   | []     |
| "6b74a7fa-626d-4994-af32-1432b9e8b0c4" | ["bolt://neo4j-core-0.neo4j.default.svc.cluster.local:7687", "http://neo4j-core-0.neo4j.default.svc.cluster.local:7474"] | {nextvoice: "FOLLOWER", neo4j: "LEADER", system: "LEADER"}     | []     |
| "775b45fe-3ae3-466d-9ad2-7b8e5ae82e0b" | ["bolt://neo4j-core-1.neo4j.default.svc.cluster.local:7687", "http://neo4j-core-1.neo4j.default.svc.cluster.local:7474"] | {nextvoice: "FOLLOWER", neo4j: "FOLLOWER", system: "FOLLOWER"} | []     |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

3 rows available after 6 ms, consumed after another 1 ms
neo4j@nextvoice> CALL apoc.trigger.add(
                 "assertExtensionNumberValidNumericalString",
                 "WITH '^([0-9]{2,5})$' AS extNumStrRegex
                 MATCH (e:Extension)
                 CALL apoc.util.validate((NOT e.number =~ extNumStrRegex), '%s not a valid extension number', [e.number])
                 RETURN NULL",
                 { phase: 'before' }
                 );
No write operations are allowed directly on this database. Writes must pass through the leader. The role of this server is: FOLLOWER

After a bunch of killing nodes and waiting for them to come back to the desired state, and connected to neo4j-core-0 via the bolt connector

neo4j@nextvoice> call dbms.cluster.overview();
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id                                     | addresses                                                                                                                | databases                                                      | groups |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| "53f95bdf-0c86-4826-8244-4ad4f7963592" | ["bolt://neo4j-core-2.neo4j.default.svc.cluster.local:7687", "http://neo4j-core-2.neo4j.default.svc.cluster.local:7474"] | {nextvoice: "FOLLOWER", neo4j: "FOLLOWER", system: "FOLLOWER"} | []     |
| "6b74a7fa-626d-4994-af32-1432b9e8b0c4" | ["bolt://neo4j-core-0.neo4j.default.svc.cluster.local:7687", "http://neo4j-core-0.neo4j.default.svc.cluster.local:7474"] | {nextvoice: "LEADER", neo4j: "LEADER", system: "LEADER"}       | []     |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

2 rows available after 0 ms, consumed after another 1 ms
neo4j@nextvoice> CALL apoc.trigger.add(
                 "assertExtensionNumberValidNumericalString",
                 "WITH '^([0-9]{2,5})$' AS extNumStrRegex
                 MATCH (e:Extension)
                 CALL apoc.util.validate((NOT e.number =~ extNumStrRegex), '%s not a valid extension number', [e.number])
                 RETURN NULL",
                 { phase: 'before' }
                 );
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| name                                        | query                                                                                                                                                                              | selector          | params | installed | paused |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| "assertExtensionNumberValidNumericalString" | "WITH '^([0-9]{2,5})$' AS extNumStrRegex
MATCH (e:Extension)
CALL apoc.util.validate((NOT e.number =~ extNumStrRegex), '%s not a valid extension number', [e.number])
RETURN NULL" | {phase: "before"} | {}     | TRUE      | FALSE  |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

1 row available after 10 ms, consumed after another 30 ms

gwvandesteeg avatar Dec 01 '22 08:12 gwvandesteeg