ton icon indicating copy to clipboard operation
ton copied to clipboard

validator-engine memory leak (when validating)

Open Skydev0h opened this issue 4 years ago • 10 comments

When validating, validator-engine memory usage raises gradually over time until the VE gets reaped by OOM killer.

I have tried to collect some evidence with jemalloc (kudos to Arseny for the idea). As for now I suspect that some ValidatorSessionDescriptionImpl are unneccessarily immortal and get collected over time.

Another thing that caught my eye is that ValidatorSessionImpl has a start_up method but has no tear_down method. Should be noted that half of those Descs are spawned in ValidatorSessionImpl::start_up calling ValidatorSessionState::move_to_persistent, another half are created by create_actor.

Some food for thinking (my attempts at building some deltas): app-delta

app-delta-2

app-delta-3

Skydev0h avatar Jan 28 '20 18:01 Skydev0h

There is actually another thing that is gradually raising in space usage, on the very bottom of the graph. It may be the culprit (I dont see much raise in ValidatorSessionDescriptionImpl, it is still 2048+2048, while the bottom one slowly raises. app-delta-4 Arena near the top is slowly becoming colosseum too.

Skydev0h avatar Jan 28 '20 19:01 Skydev0h

Last prof before OOM reaper indicates that ValidatorSessionDescriptionImpl totally still uses same 4096 MB of memory, while the box in bottom (rocksdb::BlockFetcher::PrepareBufferForBlockFromFile) is constantly rising, slowly but steadily. Judging from the pace of normal memory rise when used, I think that it may be the culprit.

app-delta-5

Skydev0h avatar Jan 29 '20 00:01 Skydev0h

Focusing on that little box those are changes over time: image image image image And the colosseum: image image image image Looks like a typical memory leak, either in third-party library or improper usage of results. The pace of memory usage nearly corresponds to the pace of increasing memory usage. Due to some GC stuff it is difficult to judge just from memory graph, will analyze later.

Skydev0h avatar Jan 29 '20 00:01 Skydev0h

Some memory profile of validator over night. Despite the sawtooth, lower edge of it steadily rises over time. And that slow and steady rise has nearly the same pace as rise of that last block I found.

mem-graph - Copy

Approximately 200 MB over 14500‬ seconds (~4 hours), that is 50 MB per hour. That would take, for example, just 13.6 days to fill 16 GB of RAM. That is not counting the sudden spikes that are not reclaimed later.

An excel graph with grid and moving average is more representative: image BTW, that RSS drop occurs nearly every 236 seconds to be more precise. At least that is horizontal grid interval.

Skydev0h avatar Jan 29 '20 06:01 Skydev0h

It may be possible that I have found a fix, but still need several days for testing. May be not. But looks a little more managable with some more shelves. Before: image After: image

Skydev0h avatar Jan 30 '20 21:01 Skydev0h

@Skydev0h Glad to see someone tackling this. The solution for now is to use systemctl daemon to restart the validator engine whenever it crashes due to this leak. It's great for testing failure of the core system :)

You're talking about 13.6 days but I've had very mixed results. Sometimes it takes an hour, sometimes a week. It does not correlate with processing incoming messages, you can run a 'master node' (with its own zerostate) and just watch it eat itself.

hortonelectric avatar Feb 01 '20 07:02 hortonelectric

@ton-blockchain I seem to have stabilized to some extent at least one possible leak factor. Should require more testing, but graph looks more stable now (at least it keeps stable memory usage for more time). May be a first step but not the last. image Please notice how it stopped constantly raising memory after cache (the 1GB NewLRUCache(1 << 30) maybe?) got filled up, and now possibly leaks for some another reason (I think that those sudden +1GB and +2GB spikes are arena allocations for ValidatorSessionDescriptionImpl mentioned earlier) I am not even making a PR because it is one line change. image This may theoretically degrade perfomance under some circumstances, but otherwise index and filter blocks are stored in never-evicted heap and may limitlessly consume memory. It may be reasonable to set cache_index_and_filter_blocks_with_high_priority, but I have not yet observed perfomance problems and increased amount of SLOWs in validator logs. image

Skydev0h avatar Feb 01 '20 10:02 Skydev0h

@hortonelectric does the masternode do validation tasks? Or just a simple full node? I did not observe memory leaks for simple full node.

Skydev0h avatar Feb 01 '20 15:02 Skydev0h

Yes i am talking about a validator... In my regtest setup i use a single node that does everything hence "master"

On Sat, Feb 1, 2020, 11:29 PM Oleksandr [email protected] wrote:

@hortonelectric https://github.com/hortonelectric does the masternode do validation tasks? Or just a simple full node? I did not observe memory leaks for simple full node.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ton-blockchain/ton/issues/235?email_source=notifications&email_token=AAKTGRIO6Z2IOG4G6NYVLZDRAWIPHA5CNFSM4KMXKZ52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKQ7TSI#issuecomment-581040585, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKTGRIVAMYWOASH6VB3DA3RAWIPHANCNFSM4KMXKZ5Q .

hortonelectric avatar Feb 02 '20 03:02 hortonelectric

Got the same problem, memory leak by waves.

jetaweb avatar Jun 22 '22 18:06 jetaweb