ergo icon indicating copy to clipboard operation
ergo copied to clipboard

Node crash after node restart

Open bekman1 opened this issue 3 years ago • 11 comments

Hi guys, i set a Ergo node on my MacOS BigSur 11.4 openjdk version "11.0.11" 2021-04-20 OpenJDK Runtime Environment AdoptOpenJDK-11.0.11+9 (build 11.0.11+9) OpenJDK 64-Bit Server VM AdoptOpenJDK-11.0.11+9 (build 11.0.11+9, mixed mode) Ergo Protocol Reference Client 4.0.12

By first node start all works ok, but if i stop the node and run again, the node start to sync, get some peer connection, but after short time i get my computer overloaded and this java error:

Uncaught error from thread [ergoref-akka.actor.default-dispatcher-5]: Java heap space, shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[ergoref] java.lang.OutOfMemoryError: Java heap space at java.base/java.util.Arrays.copyOf(Unknown Source) at java.base/java.util.ArrayList.grow(Unknown Source) at java.base/java.util.ArrayList.grow(Unknown Source) at java.base/java.util.ArrayList.add(Unknown Source) at java.base/java.util.ArrayList.add(Unknown Source) at org.iq80.leveldb.impl.WriteBatchImpl.delete(WriteBatchImpl.java:70) at org.iq80.leveldb.impl.WriteBatchImpl.delete(WriteBatchImpl.java:31) at scorex.db.LDBVersionedStore.$anonfun$cleanStart$1(LDBVersionedStore.scala:254) at scorex.db.LDBVersionedStore.$anonfun$cleanStart$1$adapted(LDBVersionedStore.scala:253) at scorex.db.LDBVersionedStore$$Lambda$2638/0x00000008008f8840.apply(Unknown Source) at scala.collection.immutable.NumericRange.foreach(NumericRange.scala:74) at scorex.db.LDBVersionedStore.cleanStart(LDBVersionedStore.scala:253) at scorex.db.LDBVersionedStore.update(LDBVersionedStore.scala:228) at scorex.crypto.authds.avltree.batch.VersionedLDBAVLStorage.$anonfun$update$1(VersionedLDBAVLStorage.scala:60) at scorex.crypto.authds.avltree.batch.VersionedLDBAVLStorage$$Lambda$2632/0x00000008008b8840.apply$mcV$sp(Unknown Source) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at scala.util.Try$.apply(Try.scala:213) at scorex.crypto.authds.avltree.batch.VersionedLDBAVLStorage.update(VersionedLDBAVLStorage.scala:52) at scorex.crypto.authds.avltree.batch.PersistentBatchAVLProver.generateProofAndUpdateStorage(PersistentBatchAVLProver.scala:25) at org.ergoplatform.nodeView.state.UtxoState.$anonfun$applyModifier$2(UtxoState.scala:113) at org.ergoplatform.nodeView.state.UtxoState$$Lambda$2617/0x000000080090d840.apply(Unknown Source) at scala.util.Success.$anonfun$map$1(Try.scala:255) at scala.util.Success.map(Try.scala:213) at org.ergoplatform.nodeView.state.UtxoState.$anonfun$applyModifier$1(UtxoState.scala:110) at org.ergoplatform.nodeView.state.UtxoState$$Lambda$1973/0x0000000800ba1040.apply(Unknown Source) at scala.util.Success.flatMap(Try.scala:251) at org.ergoplatform.nodeView.state.UtxoState.applyModifier(UtxoState.scala:109) at org.ergoplatform.nodeView.state.UtxoState.applyModifier(UtxoState.scala:34) at scorex.core.NodeViewHolder.$anonfun$applyState$1(NodeViewHolder.scala:249) at scorex.core.NodeViewHolder$$Lambda$1929/0x0000000800b8e040.apply(Unknown Source) at scala.collection.TraversableOnce.$anonfun$foldLeft$1(TraversableOnce.scala:162)

Unlock wallet is also imposible after node restart. Whats wrong? Regards

bekman1 avatar Jun 13 '21 10:06 bekman1

Give it more memory, ie. run the java command with parameter java -Xmx1024m.

pragmaxim avatar Jun 14 '21 10:06 pragmaxim

1024m not help(( attempt with 2048m and 4096m also not help. Looks like a memory leak in code. Programm on start use 1.36-1.39 GB RAM, but after a short time usage jump to 4.35 GB and i got same error "Out of memory"

18:44:36.924 INFO [tor.default-dispatcher-23] s.c.n.peer.PeerManager - /159.65.11.55:9030 penalized, penalty: NonDeliveryPenalty 18:44:36.924 INFO [tor.default-dispatcher-23] s.c.n.peer.PeerManager - /159.65.11.55:9030 penalized, penalty: NonDeliveryPenalty 18:44:37.148 INFO [tor.default-dispatcher-19] o.e.n.ErgoNodeViewSynchronizer - Peer ConnectedPeer(ConnectionId(remote=/159.65.11.55:9030, local=/192.168.178.91:56812, direction=Outgoing)) has not delivered asked modifier 5985b0ac13c7f89a2c8b08f3ebfce3f5c783eb915bde3bad2c340b08ad86fd92 on time 18:44:37.148 INFO [tor.default-dispatcher-23] s.c.n.peer.PeerManager - /159.65.11.55:9030 penalized, penalty: NonDeliveryPenalty 18:44:39.066 INFO [tor.default-dispatcher-22] s.c.n.NetworkController - Failed to connect to : /104.186.109.55:9030 akka.io.TcpOutgoingConnection$$anon$2: Connect timeout of Some(1 second) expired 18:44:42.524 INFO [tor.default-dispatcher-11] s.c.n.NetworkController - Connecting to peer: PeerInfo(PeerSpec(ergoref,Version(4,0,10),legatumpool-node-1,Some(/63.250.54.96:9030),Vector(ModeFeature(utxo,true,None,-1))),0,None) Uncaught error from thread [ergoref-akka.actor.default-dispatcher-15]: Java heap space, shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[ergoref] java.lang.OutOfMemoryError: Java heap space at java.base/java.util.Arrays.copyOf(Unknown Source) at java.base/java.util.ArrayList.grow(Unknown Source) at java.base/java.util.ArrayList.grow(Unknown Source) at java.base/java.util.ArrayList.add(Unknown Source) at java.base/java.util.ArrayList.add(Unknown Source) at org.iq80.leveldb.impl.WriteBatchImpl.delete(WriteBatchImpl.java:70) at org.iq80.leveldb.impl.WriteBatchImpl.delete(WriteBatchImpl.java:31) at scorex.db.LDBVersionedStore.$anonfun$cleanStart$1(LDBVersionedStore.scala:254) at scorex.db.LDBVersionedStore.$anonfun$cleanStart$1$adapted(LDBVersionedStore.scala:253) at scorex.db.LDBVersionedStore$$Lambda$2641/0x0000000800913840.apply(Unknown Source) at scala.collection.immutable.NumericRange.foreach(NumericRange.scala:74) at scorex.db.LDBVersionedStore.cleanStart(LDBVersionedStore.scala:253) at scorex.db.LDBVersionedStore.update(LDBVersionedStore.scala:228) at scorex.crypto.authds.avltree.batch.VersionedLDBAVLStorage.$anonfun$update$1(VersionedLDBAVLStorage.scala:60) at scorex.crypto.authds.avltree.batch.VersionedLDBAVLStorage$$Lambda$2635/0x00000008008f3c40.apply$mcV$sp(Unknown Source) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at scala.util.Try$.apply(Try.scala:213) at scorex.crypto.authds.avltree.batch.VersionedLDBAVLStorage.update(VersionedLDBAVLStorage.scala:52) at scorex.crypto.authds.avltree.batch.PersistentBatchAVLProver.generateProofAndUpdateStorage(PersistentBatchAVLProver.scala:25) at org.ergoplatform.nodeView.state.UtxoState.$anonfun$applyModifier$2(UtxoState.scala:113) at org.ergoplatform.nodeView.state.UtxoState$$Lambda$2620/0x0000000800eb5040.apply(Unknown Source) at scala.util.Success.$anonfun$map$1(Try.scala:255) at scala.util.Success.map(Try.scala:213) at org.ergoplatform.nodeView.state.UtxoState.$anonfun$applyModifier$1(UtxoState.scala:110) at org.ergoplatform.nodeView.state.UtxoState$$Lambda$1956/0x0000000800ba2840.apply(Unknown Source) at scala.util.Success.flatMap(Try.scala:251) at org.ergoplatform.nodeView.state.UtxoState.applyModifier(UtxoState.scala:109) at org.ergoplatform.nodeView.state.UtxoState.applyModifier(UtxoState.scala:34) at scorex.core.NodeViewHolder.$anonfun$applyState$1(NodeViewHolder.scala:249) at scorex.core.NodeViewHolder$$Lambda$1913/0x0000000800b90040.apply(Unknown Source) at scala.collection.TraversableOnce.$anonfun$foldLeft$1(TraversableOnce.scala:162) at scala.collection.TraversableOnce.$anonfun$foldLeft$1$adapted(TraversableOnce.scala:162) 18:44:42.531 INFO [tor.default-dispatcher-12] o.e.n.UtxoNodeViewHolder - Apply modifier 3028384d4201f18befa4479ed8607d388e945fd63930764d854d85812f2b22d5 of type 108 to nodeViewHolder [ERROR] [SECURITY][06/14/2021 18:44:42.533] [ergoref-akka.actor.default-dispatcher-15] [akka.actor.ActorSystemImpl(ergoref)] Uncaught error from thread [ergoref-akka.actor.default-dispatcher-15]: Java heap space, shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[ergoref] error always at the same place, after ConnectedPeer has not delivered asked modifier

bekman1 avatar Jun 14 '21 16:06 bekman1

@bekman1 @pragmaxim on MacOS, pure-Java port of LevelDB is used (as native LevelDB via JNI had problems with crashing all the time).

If error is about the same (and having scorex.db.LDBVersionedStore.cleanStart(LDBVersionedStore.scala:253 in its stacktrace) all the time,you can try following setting

ergo { ... node { ... keepVersions = 20 } }

kushti avatar Jun 15 '21 22:06 kushti

added in ergo.conf keepVersions = 20 , started java with 4GB heap, but also got OutOfMemory error. But this time error come at this place:

10:16:54.223 INFO [tor.default-dispatcher-18] s.c.n.NetworkController - Connecting to peer: PeerInfo(PeerSpec(unknown,Version(0,0,1),unknown-/209.217.206.254:9030,Some(/209.217.206.254:9030),List()),0,None) 10:16:54.223 WARN [tor.default-dispatcher-18] s.c.n.NetworkController - Connection to peer /209.217.206.254:9030 is already established 10:16:59.414 INFO [ctor.default-dispatcher-4] s.c.n.NetworkController - Connecting to peer: PeerInfo(PeerSpec(ergoref,Version(4,0,10),ergo-mainnet-4.0.0,Some(/159.65.11.55:9030),Vector(ModeFeature(utxo,true,None,-1), SessionIdPeerFeature([B@4c09a27,7884669804957321718))),1623914123569,Some(Outgoing)) 10:16:59.414 WARN [ctor.default-dispatcher-4] s.c.n.NetworkController - Connection to peer /159.65.11.55:9030 is already established

Uncaught error from thread [ergoref-akka.actor.default-dispatcher-17]: Java heap space, shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[ergoref] java.lang.OutOfMemoryError: Java heap space at java.base/java.util.Arrays.copyOf(Unknown Source) 10:16:59.416 INFO [ctor.default-dispatcher-4] o.e.n.UtxoNodeViewHolder - Apply modifier c08642ef85ae81042338563e50909bde9edb4592832d9ea78e95cfcf50a1fc40 of type 108 to nodeViewHolder at java.base/java.util.ArrayList.grow(Unknown Source) at java.base/java.util.ArrayList.grow(Unknown Source) at java.base/java.util.ArrayList.add(Unknown Source) at java.base/java.util.ArrayList.add(Unknown Source) at org.iq80.leveldb.impl.WriteBatchImpl.delete(WriteBatchImpl.java:70) at org.iq80.leveldb.impl.WriteBatchImpl.delete(WriteBatchImpl.java:31) at scorex.db.LDBVersionedStore.$anonfun$cleanStart$1(LDBVersionedStore.scala:254) at scorex.db.LDBVersionedStore.$anonfun$cleanStart$1$adapted(LDBVersionedStore.scala:253) at scorex.db.LDBVersionedStore$$Lambda$2617/0x00000008009f4040.apply(Unknown Source) at scala.collection.immutable.NumericRange.foreach(NumericRange.scala:74) at scorex.db.LDBVersionedStore.cleanStart(LDBVersionedStore.scala:253) at scorex.db.LDBVersionedStore.update(LDBVersionedStore.scala:228) at scorex.crypto.authds.avltree.batch.VersionedLDBAVLStorage.$anonfun$update$1(VersionedLDBAVLStorage.scala:60) at scorex.crypto.authds.avltree.batch.VersionedLDBAVLStorage$$Lambda$2611/0x00000008009fdc40.apply$mcV$sp(Unknown Source) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at scala.util.Try$.apply(Try.scala:213) at scorex.crypto.authds.avltree.batch.VersionedLDBAVLStorage.update(VersionedLDBAVLStorage.scala:52) at scorex.crypto.authds.avltree.batch.PersistentBatchAVLProver.generateProofAndUpdateStorage(PersistentBatchAVLProver.scala:25) at org.ergoplatform.nodeView.state.UtxoState.$anonfun$applyModifier$2(UtxoState.scala:113) at org.ergoplatform.nodeView.state.UtxoState$$Lambda$2596/0x0000000800a10040.apply(Unknown Source) at scala.util.Success.$anonfun$map$1(Try.scala:255) at scala.util.Success.map(Try.scala:213) at org.ergoplatform.nodeView.state.UtxoState.$anonfun$applyModifier$1(UtxoState.scala:110) at org.ergoplatform.nodeView.state.UtxoState$$Lambda$1958/0x0000000800ba6840.apply(Unknown Source) at scala.util.Success.flatMap(Try.scala:251) at org.ergoplatform.nodeView.state.UtxoState.applyModifier(UtxoState.scala:109) at org.ergoplatform.nodeView.state.UtxoState.applyModifier(UtxoState.scala:34) at scorex.core.NodeViewHolder.$anonfun$applyState$1(NodeViewHolder.scala:249) at scorex.core.NodeViewHolder$$Lambda$1914/0x0000000800b93840.apply(Unknown Source) at scala.collection.TraversableOnce.$anonfun$foldLeft$1(TraversableOnce.scala:162) at scala.collection.TraversableOnce.$anonfun$foldLeft$1$adapted(TraversableOnce.scala:162)

bekman1 avatar Jun 17 '21 07:06 bekman1

The mac database has been replaced, please open a new issue if the problem persists in the latest node release.

glasgowm148 avatar Feb 02 '22 11:02 glasgowm148

I seem to have this same or similar problem using ARM in particular. x86 machines with similar specs (2core, 4gb ram) can sync and run just fine. I stuck to x86, but figured I should at least document the instance:

ergo.log

dreams-money avatar Feb 08 '22 23:02 dreams-money

The problem is in pure-Java LevelDB implementation which is nightmare it seems

kushti avatar Feb 09 '22 13:02 kushti

The Github repo does seem to have a seemingly similar 2020 open issue. Would it be appropriate to mention this?

dreams-money avatar Feb 12 '22 09:02 dreams-money

Yes, could be related, thanks!

kushti avatar Feb 14 '22 00:02 kushti

I know nothing of ergo, but intrigued with the issue been reported to Java Leveldb.

I made my some testing on ergo docker image cdd2cdc38420 to reproduce the issue.

To produce the OutOfMemoryError I did the following:

  1. Removed leveldb native implementation from META-INF\native folder in ergo.jar to force usage of Java implementation.
  2. Used docker run --rm -p 9030:9030 -p 127.0.0.1:9053:9053 -v /localdatafolder:/home/ergo/.ergo --entrypoint "java" ergoplatform/ergo -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/localdatafolder -jar /home/ergo/.ergo/modifiedergo.jar --mainnet --config /home/ergo/.ergo/ergo.conf
  3. Wait for out of memory issue. (same with restarting the container)

By inspecting the heap dump, 98% of the space used by a leveldb object WriteBatchImpl, retained by scorex.db.LDBVersionedStore$$Lambda$1189, from a ForkJoin thread: ergoref-akka.actor.default-dispatcher-5. So this is not a memory leak, but how Ergo code operate. In my case, I see a single WriteBatchImpl, being created by LDBVersionedStore with almost 37300000 Entries at the moment of failure.

Leveldb jni and pure Java have different memory model/consumption so using native implementation could also produce memory issues, just eventually later.

pcmind avatar Apr 18 '22 14:04 pcmind

I know nothing of ergo, but intrigued with the issue been reported to Java Leveldb.

I made my some testing on ergo docker image cdd2cdc38420 to reproduce the issue.

To produce the OutOfMemoryError I did the following:

  1. Removed leveldb native implementation from META-INF\native folder in ergo.jar to force usage of Java implementation.
  2. Used docker run --rm -p 9030:9030 -p 127.0.0.1:9053:9053 -v /localdatafolder:/home/ergo/.ergo --entrypoint "java" ergoplatform/ergo -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/localdatafolder -jar /home/ergo/.ergo/modifiedergo.jar --mainnet --config /home/ergo/.ergo/ergo.conf
  3. Wait for out of memory issue. (same with restarting the container)

By inspecting the heap dump, 98% of the space used by a leveldb object WriteBatchImpl, retained by scorex.db.LDBVersionedStore$$Lambda$1189, from a ForkJoin thread: ergoref-akka.actor.default-dispatcher-5. So this is not a memory leak, but how Ergo code operate. In my case, I see a single WriteBatchImpl, being created by LDBVersionedStore with almost 37300000 Entries at the moment of failure.

Leveldb jni and pure Java have different memory model/consumption so using native implementation could also produce memory issues, just eventually later.

@kushti @pragmaxim Is there a way to curve what @pcmind found? - I'm wondering if this would solve the sync issues I've been hearing about (mostly PI syncing issues).

dreams-money avatar Apr 30 '22 12:04 dreams-money