iroha icon indicating copy to clipboard operation
iroha copied to clipboard

[BUG] Out of memory exeption

Open timofeevmd opened this issue 9 months ago • 1 comments

OS and Environment

Linux, AWS, k8s

GIT commit hash

f348b9a8

Minimum working example / Steps to reproduce

perf report include full details list of precondition

### 0. Test objective:

  • Apply the load in the required volume.
  • Apply the load over an extended period of time.

#### 1. Infrastructure

  • iroha version: version="2.0.0-rc.1" git_commit_sha="f348b9a8"
  • java sdk version: commit: commit: efeb5a233e
  • iroha2-perf version: commit : 041736f
  • 5 peers

### SPECIAL CONDITION FOR STAND PREPARATION

  • We increased the disks for Longlive to 20 GB.

  • Ingress initially had more resources (2x horizontally) at the start of the load test to handle the peak load during the sudden start.

  • Iroha has priority; Kubernetes relocates pods within the cluster based on priorities. By default, Iroha has priority over other applications. However, services like NGINX have a higher priority, which makes sense. For that test, I increased the priority for our iroha2-test.

#### 2. images/config

### PREPARATION LONGEVITY ENV

Access to standard monitoring tools

On the perf generator

git clone https://github.com/soramitsu/iroha2-perf.git &&
git checkout iroha/2_0_0-rc_1/keypair &&
cd performance-generator/ &&
mvn -N io.takari:maven:wrapper &&
./mvnw gatling:test -Dgatling.simulationClass=simulation.transactions.rampConstant.TransferAssetSimulation -DtargetURL=  -DremoteLogin=  -DremotePassword= -DstartLevelUsers=0 -DendLevelUsers=234 -DrampDuring=4500 -DstageDuration=86400 -DmaxDuration=86401

Actual result

out of memory exeption

iroha2 logs open search

kubernetes logs

OOM Image

Resources utilization Image

performance metrics Image

Expected result

The load is applied evenly throughout the entire test. There is no CPU or memory utilization.

Logs

Mar 20 01:23:31 ip-10-1-124-86 containerd: time="2025-03-20T01:23:31.467644696Z" level=info msg="TaskOOM event container_id:\"c59c9b3d5c12272d6c37de5d0d068ddb936b74a48e396ffab002bbeffd0a98a0\""

Who can help to reproduce?

@timofeevmd @RamilMus

Notes

No response

timofeevmd avatar Mar 21 '25 12:03 timofeevmd

The issue is that Iroha consumes ~6GB of memory after 20 million transactions.

This matches current implementation (https://github.com/hyperledger-iroha/iroha/issues/5083#issuecomment-2379804636).

80%+ of memory consumes State::transactions which contains hashes of transactions mapped onto block height where they are stored (basically Map<Hash, usize>). State::transactions is a multi-version map with transactional behaviour, currently we use mv crate. Potentially memory usage can be improved if we use some specialized implementation for transactions map. Here is comparison of memory usage for various Map<Hash, usize> implementations:

Map Potential memory usage, bytes per transaction
mv::Storage 270
mv::Storage with HashMap 286
rpds::RedBlackTreeMapSync 112
rpds::HashTrieMapSync 168
dashmap::DashMap 69
chashmap::CHashMap 88
concurrent_map::ConcurrentMap 64
std::collections::BTreeMap 64
std::collections::HashMap 69

Only mv maps can be used directly for our needs. Maps from rpds crate will give about 2x memory improvement and can relatively easily be adapted for our use case (since they provide persistent behaviour). Other maps potentially could give ~3x memory improvement, but require custom implementation of multi-version and transactional logic.

So the plan is to implement custom solution based on some concurrent map with low memory usage (I think dashmap::DashMap is good choice), and in case it is not possible implement simplier solution using rpds.

dima74 avatar Apr 14 '25 14:04 dima74