kubernetes-neo4j icon indicating copy to clipboard operation
kubernetes-neo4j copied to clipboard

ERROR Failed to start Neo4j

Open mabushey opened this issue 6 years ago • 13 comments

Running in the neo4j namespace, and using storageClassName: rook-ceph-block:

$ kubectl -n neo4j get pv
NAME                   STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      AGE
datadir-neo4j-core-0   Bound    pvc-0912d181-d8a7-11e8-926a-069c6b71a13e   25Gi       RWO            rook-ceph-block   18m
datadir-neo4j-core-1   Bound    pvc-119a3765-d8a7-11e8-926a-069c6b71a13e   25Gi       RWO            rook-ceph-block   18m
datadir-neo4j-core-2   Bound    pvc-1da11b36-d8a7-11e8-926a-069c6b71a13e   25Gi       RWO            rook-ceph-block   18m
$ kubectl -n neo4j get pods
NAME           READY   STATUS    RESTARTS   AGE
neo4j-core-0   2/2     Running   2          12m
neo4j-core-1   2/2     Running   2          12m
neo4j-core-2   2/2     Running   2          12m
Starting Neo4j.
2018-10-25 22:53:15.784+0000 INFO  ======== Neo4j 3.3.6 ========
2018-10-25 22:53:15.817+0000 INFO  Starting...
2018-10-25 22:53:17.325+0000 INFO  Bolt enabled on 0.0.0.0:7687.
2018-10-25 22:53:17.335+0000 INFO  Initiating metrics...
2018-10-25 22:53:17.493+0000 INFO  Resolved initial host 'neo4j.default.svc.cluster.local:5000' to []
2018-10-25 22:53:17.521+0000 INFO  My connection info: [
        Discovery:   listen=0.0.0.0:5000, advertised=neo4j-core-2.neo4j.neo4j.svc.cluster.local:5000,
        Transaction: listen=0.0.0.0:6000, advertised=neo4j-core-2.neo4j.neo4j.svc.cluster.local:6000, 
        Raft:        listen=0.0.0.0:7000, advertised=neo4j-core-2.neo4j.neo4j.svc.cluster.local:7000, 
        Client Connector Addresses: bolt://neo4j-core-2.neo4j.neo4j.svc.cluster.local:7687,http://neo4j-core-2.neo4j.neo4j.svc.cluster.local:7474,https://neo4j-core-2.neo4j.neo4j.svc.cluster.local:7473
]
2018-10-25 22:53:17.522+0000 INFO  Discovering cluster with initial members: [neo4j.default.svc.cluster.local:5000]
2018-10-25 22:53:17.522+0000 INFO  Attempting to connect to the other cluster members before continuing...
2018-10-25 22:58:49.904+0000 ERROR Failed to start Neo4j: Starting Neo4j failed: Component 'org.neo4j.server.database.LifecycleManagingDatabase@53499d85' was successfully initialized, but failed to start. Please see the attached cause exception "Failed to join a cluster with members {clusterId=null, bootstrappable=false, coreMembers={}}. Another member should have published a clusterId but none was detected. Please restart the cluster.". Starting Neo4j failed: Component 'org.neo4j.server.database.LifecycleManagingDatabase@53499d85' was successfully initialized, but failed to start. Please see the attached cause exception "Failed to join a cluster with members {clusterId=null, bootstrappable=false, coreMembers={}}. Another member should have published a clusterId but none was detected. Please restart the cluster.".
org.neo4j.server.ServerStartupException: Starting Neo4j failed: Component 'org.neo4j.server.database.LifecycleManagingDatabase@53499d85' was successfully initialized, but failed to start. Please see the attached cause exception "Failed to join a cluster with members {clusterId=null, bootstrappable=false, coreMembers={}}. Another member should have published a clusterId but none was detected. Please restart the cluster.".
        at org.neo4j.server.exception.ServerStartupErrors.translateToServerStartupError(ServerStartupErrors.java:68)
        at org.neo4j.server.AbstractNeoServer.start(AbstractNeoServer.java:220)
        at org.neo4j.server.ServerBootstrapper.start(ServerBootstrapper.java:111)
        at org.neo4j.server.ServerBootstrapper.start(ServerBootstrapper.java:79)
        at com.neo4j.server.enterprise.CommercialEntryPoint.main(CommercialEntryPoint.java:22)
Caused by: org.neo4j.kernel.lifecycle.LifecycleException: Component 'org.neo4j.server.database.LifecycleManagingDatabase@53499d85' was successfully initialized, but failed to start. Please see the attached cause exception "Failed to join a cluster with members {clusterId=null, bootstrappable=false, coreMembers={}}. Another member should have published a clusterId but none was detected. Please restart the cluster.".
        at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.start(LifeSupport.java:466)
        at org.neo4j.kernel.lifecycle.LifeSupport.start(LifeSupport.java:107)
        at org.neo4j.server.AbstractNeoServer.start(AbstractNeoServer.java:212)
        ... 3 more
Caused by: java.lang.RuntimeException: Error starting org.neo4j.kernel.impl.factory.GraphDatabaseFacadeFactory, /var/lib/neo4j/data/databases/graph.db
        at org.neo4j.kernel.impl.factory.GraphDatabaseFacadeFactory.initFacade(GraphDatabaseFacadeFactory.java:211)
        at com.neo4j.causalclustering.core.CommercialCoreGraphDatabase.<init>(CommercialCoreGraphDatabase.java:35)
        at com.neo4j.causalclustering.core.CommercialCoreGraphDatabase.<init>(CommercialCoreGraphDatabase.java:26)
        at com.neo4j.server.enterprise.CommercialNeoServer.lambda$static$0(CommercialNeoServer.java:29)
        at org.neo4j.server.database.LifecycleManagingDatabase.start(LifecycleManagingDatabase.java:88)
        at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.start(LifeSupport.java:445)
        ... 5 more
Caused by: org.neo4j.kernel.lifecycle.LifecycleException: Component 'org.neo4j.causalclustering.core.state.CoreLife@56e07a08' was successfully initialized, but failed to start. Please see the attached cause exception "Failed to join a cluster with members {clusterId=null, bootstrappable=false, coreMembers={}}. Another member should have published a clusterId but none was detected. Please restart the cluster.".
        at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.start(LifeSupport.java:466)
        at org.neo4j.kernel.lifecycle.LifeSupport.start(LifeSupport.java:107)
        at org.neo4j.kernel.impl.factory.GraphDatabaseFacadeFactory.initFacade(GraphDatabaseFacadeFactory.java:207)
        ... 10 more
Caused by: java.util.concurrent.TimeoutException: Failed to join a cluster with members {clusterId=null, bootstrappable=false, coreMembers={}}. Another member should have published a clusterId but none was detected. Please restart the cluster.
        at org.neo4j.causalclustering.identity.ClusterBinder.bindToCluster(ClusterBinder.java:110)
        at org.neo4j.causalclustering.core.state.CoreLife.start0(CoreLife.java:70)
        at org.neo4j.kernel.lifecycle.SafeLifecycle.transition(SafeLifecycle.java:124)
        at org.neo4j.kernel.lifecycle.SafeLifecycle.start(SafeLifecycle.java:138)
        at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.start(LifeSupport.java:445)
        ... 12 more
2018-10-25 22:58:49.910+0000 INFO  Neo4j Server shutdown initiated by request

mabushey avatar Oct 25 '18 23:10 mabushey

Resolved initial host 'neo4j.default.svc.cluster.local:5000' to [] appears to be the problem, it should be 'neo4j.neo4j.svc.cluster.local:5000'

Changed: value: "neo4j.default.svc.cluster.local:5000" -> value: "neo4j.neo4j.svc.cluster.local:5000"

2018-10-25 23:06:00.908+0000 INFO Resolved initial host 'neo4j.neo4j.svc.cluster.local:5000' to [100.96.5.28:5000, 100.96.4.41:5000, 100.96.2.24:5000]

mabushey avatar Oct 25 '18 23:10 mabushey

Still a no go:

Starting Neo4j.
2018-10-25 23:05:59.322+0000 INFO  ======== Neo4j 3.3.6 ========
2018-10-25 23:05:59.355+0000 INFO  Starting...
2018-10-25 23:06:00.763+0000 INFO  Bolt enabled on 0.0.0.0:7687.
2018-10-25 23:06:00.772+0000 INFO  Initiating metrics...
2018-10-25 23:06:00.908+0000 INFO  Resolved initial host 'neo4j.neo4j.svc.cluster.local:5000' to [100.96.5.28:5000, 100.96.4.41:5000, 100.96.2.24:5000]
2018-10-25 23:06:00.935+0000 INFO  My connection info: [
        Discovery:   listen=0.0.0.0:5000, advertised=neo4j-core-2.neo4j.neo4j.svc.cluster.local:5000,
        Transaction: listen=0.0.0.0:6000, advertised=neo4j-core-2.neo4j.neo4j.svc.cluster.local:6000, 
        Raft:        listen=0.0.0.0:7000, advertised=neo4j-core-2.neo4j.neo4j.svc.cluster.local:7000, 
        Client Connector Addresses: bolt://neo4j-core-2.neo4j.neo4j.svc.cluster.local:7687,http://neo4j-core-2.neo4j.neo4j.svc.cluster.local:7474,https://neo4j-core-2.neo4j.neo4j.svc.cluster.local:7473
]
2018-10-25 23:06:00.936+0000 INFO  Discovering cluster with initial members: [neo4j.neo4j.svc.cluster.local:5000]
2018-10-25 23:06:00.936+0000 INFO  Attempting to connect to the other cluster members before continuing...
2018-10-25 23:11:33.326+0000 ERROR Failed to start Neo4j: Starting Neo4j failed: Component 'org.neo4j.server.database.LifecycleManagingDatabase@53499d85' was successfully initialized, but failed to start. Please see the attached cause exception "Failed to join a cluster with members {clusterId=null, bootstrappable=false, coreMembers={}}. Another member should have published a clusterId but none was detected. Please restart the cluster.". Starting Neo4j failed: Component 'org.neo4j.server.database.LifecycleManagingDatabase@53499d85' was successfully initialized, but failed to start. Please see the attached cause exception "Failed to join a cluster with members {clusterId=null, bootstrappable=false, coreMembers={}}. Another member should have published a clusterId but none was detected. Please restart the cluster.".
org.neo4j.server.ServerStartupException: Starting Neo4j failed: Component 'org.neo4j.server.database.LifecycleManagingDatabase@53499d85' was successfully initialized, but failed to start. Please see the attached cause exception "Failed to join a cluster with members {clusterId=null, bootstrappable=false, coreMembers={}}. Another member should have published a clusterId but none was detected. Please restart the cluster.".
        at org.neo4j.server.exception.ServerStartupErrors.translateToServerStartupError(ServerStartupErrors.java:68)
        at org.neo4j.server.AbstractNeoServer.start(AbstractNeoServer.java:220)
        at org.neo4j.server.ServerBootstrapper.start(ServerBootstrapper.java:111)
        at org.neo4j.server.ServerBootstrapper.start(ServerBootstrapper.java:79)
        at com.neo4j.server.enterprise.CommercialEntryPoint.main(CommercialEntryPoint.java:22)
Caused by: org.neo4j.kernel.lifecycle.LifecycleException: Component 'org.neo4j.server.database.LifecycleManagingDatabase@53499d85' was successfully initialized, but failed to start. Please see the attached cause exception "Failed to join a cluster with members {clusterId=null, bootstrappable=false, coreMembers={}}. Another member should have published a clusterId but none was detected. Please restart the cluster.".
        at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.start(LifeSupport.java:466)
        at org.neo4j.kernel.lifecycle.LifeSupport.start(LifeSupport.java:107)
        at org.neo4j.server.AbstractNeoServer.start(AbstractNeoServer.java:212)
        ... 3 more
Caused by: java.lang.RuntimeException: Error starting org.neo4j.kernel.impl.factory.GraphDatabaseFacadeFactory, /var/lib/neo4j/data/databases/graph.db
        at org.neo4j.kernel.impl.factory.GraphDatabaseFacadeFactory.initFacade(GraphDatabaseFacadeFactory.java:211)
        at com.neo4j.causalclustering.core.CommercialCoreGraphDatabase.<init>(CommercialCoreGraphDatabase.java:35)
        at com.neo4j.causalclustering.core.CommercialCoreGraphDatabase.<init>(CommercialCoreGraphDatabase.java:26)
        at com.neo4j.server.enterprise.CommercialNeoServer.lambda$static$0(CommercialNeoServer.java:29)
        at org.neo4j.server.database.LifecycleManagingDatabase.start(LifecycleManagingDatabase.java:88)
        at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.start(LifeSupport.java:445)
        ... 5 more
Caused by: org.neo4j.kernel.lifecycle.LifecycleException: Component 'org.neo4j.causalclustering.core.state.CoreLife@56e07a08' was successfully initialized, but failed to start. Please see the attached cause exception "Failed to join a cluster with members {clusterId=null, bootstrappable=false, coreMembers={}}. Another member should have published a clusterId but none was detected. Please restart the cluster.".
        at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.start(LifeSupport.java:466)
        at org.neo4j.kernel.lifecycle.LifeSupport.start(LifeSupport.java:107)
        at org.neo4j.kernel.impl.factory.GraphDatabaseFacadeFactory.initFacade(GraphDatabaseFacadeFactory.java:207)
        ... 10 more
Caused by: java.util.concurrent.TimeoutException: Failed to join a cluster with members {clusterId=null, bootstrappable=false, coreMembers={}}. Another member should have published a clusterId but none was detected. Please restart the cluster.
        at org.neo4j.causalclustering.identity.ClusterBinder.bindToCluster(ClusterBinder.java:110)
        at org.neo4j.causalclustering.core.state.CoreLife.start0(CoreLife.java:70)
        at org.neo4j.kernel.lifecycle.SafeLifecycle.transition(SafeLifecycle.java:124)
        at org.neo4j.kernel.lifecycle.SafeLifecycle.start(SafeLifecycle.java:138)
        at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.start(LifeSupport.java:445)
        ... 12 more
2018-10-25 23:11:33.332+0000 INFO  Neo4j Server shutdown initiated by request

mabushey avatar Oct 25 '18 23:10 mabushey

This is mostly a duplicate of #7, however I have no clue what/where this refers to: I did forget to replace neo4j-core-0.neo4j.default.svc.cluster.local by `neo4j-core-0.neo4j.neo4j.svc.cluster.local

line 28 of statefulset.yaml contains value: "neo4j.default.svc.cluster.local:5000", and I replaced the namespace of default with neo4j

mabushey avatar Oct 25 '18 23:10 mabushey

2018-10-25 23:22:38.654+0000 INFO  ======== Neo4j 3.3.6 ========
2018-10-25 23:22:38.687+0000 INFO  Starting...
2018-10-25 23:22:40.154+0000 INFO  Bolt enabled on 0.0.0.0:7687.
2018-10-25 23:22:40.164+0000 INFO  Initiating metrics...
2018-10-25 23:22:40.291+0000 INFO  Resolved initial host 'neo4j-core-0.neo4j.neo4j.svc.cluster.local:5000' to [100.96.4.43:5000]
2018-10-25 23:22:40.317+0000 INFO  My connection info: [
        Discovery:   listen=0.0.0.0:5000, advertised=neo4j-core-2.neo4j.neo4j.svc.cluster.local:5000,
        Transaction: listen=0.0.0.0:6000, advertised=neo4j-core-2.neo4j.neo4j.svc.cluster.local:6000, 
        Raft:        listen=0.0.0.0:7000, advertised=neo4j-core-2.neo4j.neo4j.svc.cluster.local:7000, 
        Client Connector Addresses: bolt://neo4j-core-2.neo4j.neo4j.svc.cluster.local:7687,http://neo4j-core-2.neo4j.neo4j.svc.cluster.local:7474,https://neo4j-core-2.neo4j.neo4j.svc.cluster.local:7473
]
2018-10-25 23:22:40.318+0000 INFO  Discovering cluster with initial members: [neo4j-core-0.neo4j.neo4j.svc.cluster.local:5000]
2018-10-25 23:22:40.318+0000 INFO  Attempting to connect to the other cluster members before continuing...
2018-10-25 23:28:12.707+0000 ERROR Failed to start Neo4j: Starting Neo4j failed: Component 'org.neo4j.server.database.LifecycleManagingDatabase@53499d85' was successfully initialized, but failed to start. Please see the attached cause exception "Failed to join a cluster with members {clusterId=null, bootstrappable=false, coreMembers={}}. Another member should have published a clusterId but none was detected. Please restart the cluster.". Starting Neo4j failed: Component 'org.neo4j.server.database.LifecycleManagingDatabase@53499d85' was successfully initialized, but failed to start. Please see the attached cause exception "Failed to join a cluster with members {clusterId=null, bootstrappable=false, coreMembers={}}. Another member should have published a clusterId but none was detected. Please restart the cluster.".
org.neo4j.server.ServerStartupException: Starting Neo4j failed: Component 'org.neo4j.server.database.LifecycleManagingDatabase@53499d85' was successfully initialized, but failed to start. Please see the attached cause exception "Failed to join a cluster with members {clusterId=null, bootstrappable=false, coreMembers={}}. Another member should have published a clusterId but none was detected. Please restart the cluster.".

mabushey avatar Oct 25 '18 23:10 mabushey

Deleted the pvc's in case the old data was messing it up:

 $ kubectl -n neo4j delete pvc datadir-neo4j-core-0
persistentvolumeclaim "datadir-neo4j-core-0" deleted                                                                                                                                                                                                                                      
$ kubectl -n neo4j delete pvc datadir-neo4j-core-1
persistentvolumeclaim "datadir-neo4j-core-1" deleted                                                                                                                                                                                                                                      
$ kubectl -n neo4j delete pvc datadir-neo4j-core-2
 persistentvolumeclaim "datadir-neo4j-core-2" deleted

not it...

mabushey avatar Oct 25 '18 23:10 mabushey

I suspect I'm running into an istio routing issue. I appended a prefix of http- to all the port names... Still a no go.

mabushey avatar Oct 25 '18 23:10 mabushey

$ kubectl exec -it neo4j-core-0 bash  -n neo4j
Defaulting container name to neo4j.

bash-4.4# ping neo4j-core-0.neo4j.neo4j.svc.cluster.local
PING neo4j-core-0.neo4j.neo4j.svc.cluster.local (100.96.2.29): 56 data bytes

bash-4.4# ping neo4j-core-1.neo4j.neo4j.svc.cluster.local
PING neo4j-core-1.neo4j.neo4j.svc.cluster.local (100.96.4.46): 56 data bytes

bash-4.4# ping neo4j-core-2.neo4j.neo4j.svc.cluster.local
PING neo4j-core-2.neo4j.neo4j.svc.cluster.local (100.96.5.33): 56 data bytes

Added an istio egress rule for dl-cdn.alpinelinux.org and installed curl on neo4j-core-0

bash-4.4# curl 127.0.0.1:5000
curl: (7) Failed to connect to 127.0.0.1 port 5000: Connection refused
bash-4.4# curl neo4j-core-2.neo4j.neo4j.svc.cluster.local:5000
curl: (56) Recv failure: Connection reset by peer
bash-4.4# curl neo4j-core-1.neo4j.neo4j.svc.cluster.local:5000
curl: (56) Recv failure: Connection reset by peer
bash-4.4# curl neo4j-core-0.neo4j.neo4j.svc.cluster.local:5000
curl: (56) Recv failure: Connection reset by peer

mabushey avatar Oct 25 '18 23:10 mabushey

I pretty much followed https://neo4j.com/developer/kb/a-light-weight-approach-to-validating-network-port-connectivity/ - the containers can reach each other on port 5000, but they don't sync.

mabushey avatar Oct 26 '18 21:10 mabushey

It appears that NEO4J_causal__clustering_discovery__type is the old style, it has been replaced with NEO4J_causalClustering_initialDiscoveryMembers

setting:

          - name: NEO4J_causalClustering_initialDiscoveryMembers
            value: neo4j-core-0.neo4j.neo4j.svc.cluster.local:5000, neo4j-core-1.neo4j.neo4j.svc.cluster.local:5000, neo4j-core-2.neo4j.neo4j.svc.cluster.local:5000

Gets this in the log file of neo4j-core-2:

2018-10-26 22:06:57.188+0000 INFO  Resolved initial host 'neo4j-core-0.neo4j.neo4j.svc.cluster.local:5000' to []
2018-10-26 22:06:57.189+0000 INFO  Resolved initial host 'neo4j-core-1.neo4j.neo4j.svc.cluster.local:5000' to []
2018-10-26 22:06:57.189+0000 INFO  Resolved initial host 'neo4j-core-2.neo4j.neo4j.svc.cluster.local:5000' to [100.96.2.51:5000]

If I restart neo4j-core-2 after the other two are up:

2018-10-26 22:09:37.355+0000 INFO  Resolved initial host 'neo4j-core-0.neo4j.neo4j.svc.cluster.local:5000' to [100.96.4.53:5000]
2018-10-26 22:09:37.356+0000 INFO  Resolved initial host 'neo4j-core-1.neo4j.neo4j.svc.cluster.local:5000' to [100.96.5.52:5000]
2018-10-26 22:09:37.356+0000 INFO  Resolved initial host 'neo4j-core-2.neo4j.neo4j.svc.cluster.local:5000' to [100.96.2.52:5000]

So it looks like I'm using a newer version of neo4j (3.3.6) which has all different env var names ie NEO4J_causal__clustering_initial__discovery__members -> NEO4J_causalClustering_initialDiscoveryMembers. Can someone make a version of this conf that works for newer neo4j versions? It is nice to see names with double underscores get fixed, that was really dumb.

mabushey avatar Oct 26 '18 22:10 mabushey

Apparently CORE mode doesn't work in kubernetes. It's working just fine in SINGLE mode.

mabushey avatar Oct 26 '18 22:10 mabushey

I'm havingthe same problem using the official neo4j helm chart. No success .. isn't the helm chart tested somehow?

JannikZed avatar Jan 27 '19 21:01 JannikZed

Both cluster discovery type and initial cluster members are both needed in order to start the DB in mode=CORE (causal clustering).

The discovery type tells how to discover the other nodes, the initial discovery members is the address used for that discovery.

See here for an example of configuration from a different repo which uses a variant of this chart: https://github.com/neo-technology/neo4j-google-k8s-marketplace/blob/3.5/chart/templates/core-statefulset.yaml#L39

You'll notice that discovery type is set to DNS, meaning that neo4j expects to find a DNS record with multiple A records. And the initial discovery members is a service address that points to the core stateful set. This service address will resolve to multiple A records, one per pod launched.

This repo is unfortunately a bit out of date, but you can find a generic helm chart here: https://github.com/helm/charts/tree/master/stable/neo4j (that has to go through various tests for acceptance) and you can find the google kubernetes marketplace repo here https://github.com/neo-technology/neo4j-google-k8s-marketplace/

moxious avatar Jan 28 '19 14:01 moxious

windows 10 neo4j-community-5.10.0-windows

start command:neo4j console

Starting Neo4j. Error occurred during initialization of VM Too small maximum heap Neo4j web server failed to start. See log for more info. Run with '--verbose' for a more detailed error message.

image

image

duoduodady avatar Aug 04 '23 04:08 duoduodady