solana icon indicating copy to clipboard operation
solana copied to clipboard

Abnormal memory consumption in private cluster

Open KirillLykov opened this issue 1 year ago • 6 comments

Problem

While running private cluster with 2 nodes and one client with master, I've discovered the following memory trend:

Screenshot 2022-08-29 at 16 10 02

What has happen at 18:20 -- bench-tps client has been restarted because it is configured to run for 7500s in the net scripts. Two plot correspond to two nodes.

There are 3 problems with this memory plot:

  1. Memory consumption grows in time. (For some reason this growth was triggered by relaunching bench-tps)
  2. Memory consumption is spiky -- it signals that there are a lot of alloc/dealloc happening
  3. The width of spikes is growing

Problems (2) and (3) are present for both nodes, while problem (1) only for blue one.

Memory profiling

I've profiled with dhat tool (special thanks @alessandrod for showing this tool). Profiling worked on one node cluster only, it also had severe performance consequences. Note, that for the run shown below I've replaced into_par_iter to into_iter here to have a more informative call stack in the profile output.

For this run with memory profiling we observe the following:

Screenshot 2022-09-02 at 14 37 38

One important observation is that spike event is ~ 13:40.

In the profile output, the leaf call is the following:

Screenshot 2022-09-02 at 09 51 30

So the suspect is solana_runtime::accounts_db::AppendVecScan>::init_accum (runtime/src/accounts_db.rs:1817:13)

The call of init_accum grow vec<vec<>>. The size of this vector is ~32640. From the plot above we observe that num_chunks grow and this growth is started at 13:41. From the log also: Screenshot 2022-09-02 at 10 32 35

Report of dhat might be found in this folder and can be opened with https://nnethercote.github.io/dh_view/dh_view.html

Checking out v1.11.10

To check if the problem is there for v1.11.10, we did another run of private cluster. This time memory profile looks differently. In particular, it lacks (3). Memory spikes do not grow and also slots count is stable: Screenshot 2022-09-02 at 15 03 40

Hypothesis

After looking into this data with @alessandrod , we came up with the hypothesis that memory spikes are due to scan_account_storage_no_bank.

I think that the amplification of spikes is not present for v1.11.10 due to this commit which was not added to this version.

The spikes themself are due to memory allocations during chunks scanning.

Proposed Solution

KirillLykov avatar Sep 02 '22 13:09 KirillLykov

Similar behavior now for branch v1.11 (e6a9c7e47c26f753f791509c77f6e1e39b8c5cbc) Screenshot 2022-09-13 at 16 40 49

it contains changes which contribute to amplification of the memory in scanner. But it doesn't show the memory growth as for the master (6684c62280d824a03e789f0e100a78c634bb1f01) shown above. Which might suggest that these problems are different.

KirillLykov avatar Sep 13 '22 14:09 KirillLykov

Current master afa2ae2986cd704c0d7cb22bfa7238cd9e4372cc shows different trend: Screenshot 2022-09-13 at 16 45 40

Looks like problem with "fur" has been resolved by this PR: https://github.com/solana-labs/solana/pull/27461 The problem with memory growth is there.

I've profiled the moment of growth with memleak.py eBPF-based tool which collects information about outstanding long-living allocations:

   11448320 bytes in 48 allocations from stack
        0x00005649d792c13f  memmap2::os::MmapInner::map_mut::h103aef6f6c352286+0x9f [solana-validator]
        0x00005649d723be23  memmap2::MmapMut::map_mut::hc878c70e96a171e1+0x53 [solana-validator]
        0x00005649d704eae2  solana_runtime::append_vec::AppendVec::new::h037be5d3956f93e1+0x182 [solana-validator]
        0x00005649d71aca25  solana_runtime::accounts_db::AccountStorageEntry::new::hd6792b2f457b2a6e+0x115 [solana-validator]
        0x00005649d71c5912  solana_runtime::accounts_db::AccountsDb::create_store::h90136ccd567e2d1b+0xa2 [solana-validator]
        0x00005649d71bc863  solana_runtime::accounts_db::AccountsDb::get_store_for_shrink::hc059c51f5d663fba+0x1a3 [solana-validator]
        0x00005649d719d2f2  core::ops::function::impls::_$LT$impl$u20$core..ops..function..FnMut$LT$A$GT$$u20$for$u20$$RF$F$GT$::call_mut::h236c9325c20e4fd9+0xaf2 [solana-validator]
        0x00005649d71106a6  _$LT$core..iter..adapters..map..Map$LT$I$C$F$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$::fold::hd7f27a87336308e5+0x76 [solana-validator]
        0x00005649d726c303  rayon::iter::plumbing::bridge_producer_consumer::helper::h846dac16a50cb7e8+0x63 [solana-validator]
        0x00005649d725f1d3  rayon_core::registry::in_worker::he184540d33da092c+0x2b3 [solana-validator]
        0x00005649d726c3bf  rayon::iter::plumbing::bridge_producer_consumer::helper::h846dac16a50cb7e8+0x11f [solana-validator]
        0x00005649d725f1d3  rayon_core::registry::in_worker::he184540d33da092c+0x2b3 [solana-validator]
        0x00005649d726c3bf  rayon::iter::plumbing::bridge_producer_consumer::helper::h846dac16a50cb7e8+0x11f [solana-validator]
        0x00005649d725f1d3  rayon_core::registry::in_worker::he184540d33da092c+0x2b3 [solana-validator]
        0x00005649d726c3bf  rayon::iter::plumbing::bridge_producer_consumer::helper::h846dac16a50cb7e8+0x11f [solana-validator]
        0x00005649d703f3a2  _$LT$rayon..vec..IntoIter$LT$T$GT$$u20$as$u20$rayon..iter..IndexedParallelIterator$GT$::with_producer::h7be1d112c1b44ff8+0xc2 [solana-validator]
        0x00005649d70ebf2d  _$LT$core..panic..unwind_safe..AssertUnwindSafe$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$$LP$$RP$$GT$$GT$::call_once::hd5c2c25b22f16984+0xed [solana-validator]
        0x00005649d709bdf6  _ZN83_$LT$rayon_core..job..StackJob$LT$L$C$F$C$R$GT$$u20$as$u20$rayon_core..job..Job$GT$7execute17h3645fdebe8435b28E.llvm.7304792509247438073+0x56 [solana-validator]
        0x00005649d5cef648  rayon_core::registry::WorkerThread::wait_until_cold::hf719c0333e1c469c+0x308 [solana-validator]
        0x00005649d7891a30  rayon_core::registry::ThreadBuilder::run::h218b6e6a0796ae65+0x310 [solana-validator]
        0x00005649d7894a31  _ZN3std10sys_common9backtrace28__rust_begin_short_backtrace17h98bf7e27e7f462daE.llvm.4672532708118862921+0x41 [solana-validator]
        0x00005649d788e2d0  core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h1b17b96de63b704e+0xb0 [solana-validator]
        0x00005649d7b109b3  std::sys::unix::thread::Thread::new::thread_start::h5d4e11bbda4161c8+0x23 [solana-validator]
   107479040 bytes in 8 allocations from stack
        0x00005649d7ac6293  _rjem_je_pages_map+0x63 [solana-validator]
    657244160 bytes in 400 allocations from stack
        0x00005649d792c13f  memmap2::os::MmapInner::map_mut::h103aef6f6c352286+0x9f [solana-validator]
        0x00005649d723be23  memmap2::MmapMut::map_mut::hc878c70e96a171e1+0x53 [solana-validator]
        0x00005649d704eae2  solana_runtime::append_vec::AppendVec::new::h037be5d3956f93e1+0x182 [solana-validator]
        0x00005649d71aca25  solana_runtime::accounts_db::AccountStorageEntry::new::hd6792b2f457b2a6e+0x115 [solana-validator]
        0x00005649d71c5912  solana_runtime::accounts_db::AccountsDb::create_store::h90136ccd567e2d1b+0xa2 [solana-validator]
        0x00005649d71cb343  solana_runtime::accounts_db::AccountsDb::flush_rooted_accounts_cache::h4e80df79ac468be7+0x813 [solana-validator]
        0x00005649d71c9403  solana_runtime::accounts_db::AccountsDb::flush_accounts_cache::h1297d04cdc823b70+0xc3 [solana-validator]
        0x00005649d704c0a5  solana_runtime::accounts_background_service::SnapshotRequestHandler::handle_snapshot_requests::h3095f05990a42be2+0x2b5 [solana-validator]
        0x00005649d7077033  std::sys_common::backtrace::__rust_begin_short_backtrace::h1903f68ea30a084b+0x393 [solana-validator]
        0x00005649d71e3378  core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h78829cde464f478b+0x98 [solana-validator]
        0x00005649d7b109b3  std::sys::unix::thread::Thread::new::thread_start::h5d4e11bbda4161c8+0x23 [solana-validator]

The most interesting allocation is the last one. It looks like several functions in the call chain were inlined, they probable call stack is the following:

create_store
create_and_insert_store_with_paths
create_and_insert_store
do_flush_slot_cache
flush_slot_cache_with_clean
flush_rooted_accounts_cache

KirillLykov avatar Sep 13 '22 14:09 KirillLykov

Another question is why memory grow on one machine but not on the other. This is unclear

KirillLykov avatar Sep 13 '22 15:09 KirillLykov

To exclude that the problem is due to impossible super majority, I've tried the same with 5 nodes:

cd net/
./gce.sh create -n 4 -c 1
./init-metrics.sh <login>
./net.sh start --profile 

Looks the same as for 2 nodes: Screenshot 2022-09-15 at 21 16 14

As before there are 2 problems:

  1. 10G frequent allocations/deallocations
  2. Memory grows for one of the nodes

The cause of the problem (1) is identified (see above) with high certainty. The cause of the problem (2) is not clear.

Observations:

  1. The memory grows only in one of the nodes (Cujx...)

  2. All the traffic is received by this node

  3. This node has very low number of quic streams compared to the others 277 vs 380K Screenshot 2022-09-15 at 21 24 36

  4. Checking out outstanding allocations on this node identify create_store call within flush_rooted_accounts_cache (see comments above) but when I've fixed and checked corresponding metric, I don't observe increase of number of calls. But it might be that these store objects are stored somewhere and never released, this is not checked I guess. Screenshot 2022-09-15 at 22 47 34

KirillLykov avatar Sep 15 '22 20:09 KirillLykov

A note about store_create and memory peaks: if we look into time of memory peaks and at create_store count, they are coincident: Screenshot 2022-09-16 at 15 36 38

KirillLykov avatar Sep 16 '22 13:09 KirillLykov

If we use TpuClient instead of ThinClient, it leads to uniform traffic distribution among nodes and this lower the memory growth trend (but it is still present). Surprisingly, it also shows increase in amplitude for the allocation/deallocations:

Screenshot 2022-09-20 at 17 57 18

This run was for cccade42b3ea73dc8be39c659d1b5e8f950b341b

Holes in the traffic are due to client restarts. It happens when it detects that blockhash wasn't updated for more than 120s.

KirillLykov avatar Sep 20 '22 16:09 KirillLykov