quorum icon indicating copy to clipboard operation
quorum copied to clipboard

Memory usage of non-validator nodes grows indefinitely, leading to OOM and unclean shutdown

Open hhsel opened this issue 11 months ago • 1 comments

Expected behaviour

When running a QBFT cluster, memory usage should stay within a moderate value range as long as the cluster is not busy.

Actual behaviour

Memory usage of QBFT non-validator nodes grows over time at a rate of approx. 50MB/day, if the cluster keeps producing empty blocks every 1 second, for example. Non-validator nodes will be killed by OOM as a result. I have experienced this with 2GB and 4GB nodes, and it took about 1 and 2 months for the nodes to be killed by OOM.

OOM causes an unclean shutdown, which means the node loses its intermediate states that are not persisted to its disk. The memory usage grows indefinitely, even the cluster is producing just empty blocks and does almost nothing on the chain. In my case 8 out of 8 non-validator nodes in the cluster have the same results.

Validator nodes, on the other hand, have similar tendencies but several sudden memory usage drops have been observed (its frequency is not regular nor expectable but about once in 1-2 weeks).

As a result, for non-validator nodes, I must watch its memory usage closely and take nodes out of a load balancer and restart them when memory usage gets high, to avoid OOM.

Steps to reproduce the behaviour

Start a QBFT cluster with an arbitrary number of non-validator nodes, and let the cluster produce empty blocks. Memory usage of non-validator nodes grows indefinitely, causing OOM after some months.

hhsel avatar Jul 21 '23 23:07 hhsel