forest snapshot validate: unbound memory

Issue summary

The memory needed to validate a mainnet snapshot seems unbound. On a 16 GB RAM machines, it's causing OOM after a minute or so.

According to @lemmih, the culprit is the FVM

Sigh, I think the FVM is taking up 19GiB of RAM. We'll need to address that at some point.

Command: forest-cli snapshot validate <mainnet snapshot> commit: b03ca5d61a18c236fcaa9bfebeae706108e6ed85

Other information and links

Aug 01 '23 13:08 LesnyRumcajs

This kills one of our killer features :/

Aug 01 '23 15:08 aatifsyed

Could it be mitigated by limiting parallelization in validate_tipsets?

Aug 01 '23 15:08 hanabi1224

Could it be mitigated by limiting parallelization in validate_tipsets?

Yep, thus killing our killer feature.

Aug 08 '23 07:08 lemmih

I think the WASM engine settings might be to blame. https://github.com/filecoin-project/ref-fvm/blob/f31c6d3a64278f98270e5a13fc6e8be11e5c534e/fvm/src/engine/mod.rs#L137

    // wasmtime default: OnDemand
    // We want to pre-allocate all permissible memory to support the maximum allowed recursion limit.

Things to investigate:

[ ] Does re-initializing the MultiEngine reset the memory usage?
[ ] Do different wasm_config settings affect memory usage?

Aug 08 '23 07:08 lemmih

Isn't wasm32 limited to 4 GB?

Aug 08 '23 07:08 LesnyRumcajs

Isn't wasm32 limited to 4 GB?

I think they even lower the limit from 4GiB to 2GiB. But they have a pool of engines, one for each core, each with a 2GiB limit.

    /// Maximum size of memory used during the entire (recursive) message execution. This currently
    /// includes Wasm memories and table elements and will eventually be extended to include IPLD
    /// blocks and actor code.
    ///
    /// DEFAULT: 2GiB
    pub max_memory_bytes: u64,

    // wasmtime default: 4GB
    c.static_memory_maximum_size(instance_memory_maximum_size);

Aug 08 '23 07:08 lemmih

So on my 32 cores it would require 64GB?

Aug 08 '23 07:08 LesnyRumcajs

So on my 32 cores it would require 64GB?

As far as I can tell, yes.

Aug 08 '23 07:08 lemmih

Change Description Network No of Threads Epochs Validated Snapshot Info RSS VSZ

BaseLine Calibnet 1 60 forest_snapshot_calibnet_2023-08-14_height_822490.forest.car.zst(1.9Gb) 739.01 MB 3036361.56 MB

BaseLine Calibnet 2 60 forest_snapshot_calibnet_2023-08-14_height_822490.forest.car.zst(1.9Gb) 762.50 MB 3036432.92 MB

BaseLine Calibnet 4 60 forest_snapshot_calibnet_2023-08-14_height_822490.forest.car.zst(1.9Gb) 846.67 MB 3036739.99 MB

BaseLine Calibnet 8 60 forest_snapshot_calibnet_2023-08-14_height_822490.forest.car.zst(1.9Gb) 866.19 MB 3037035.18 MB

BaseLine Calibnet 1 1999 forest_snapshot_calibnet_2023-08-14_height_822490.forest.car.zst(1.9Gb) 898.11 MB 3036352.50 MB

BaseLine Calibnet 2 1999 forest_snapshot_calibnet_2023-08-14_height_822490.forest.car.zst(1.9Gb) 878.01 MB 3036441.77 MB

BaseLine Calibnet 4 1999 forest_snapshot_calibnet_2023-08-14_height_822490.forest.car.zst(1.9Gb) 900.31 MB 3036771.31 MB

BaseLine Calibnet 8 1999 forest_snapshot_calibnet_2023-08-14_height_822490.forest.car.zst(1.9Gb) 934.74 MB 3037106.39 MB

BaseLine Mainnet 1 60 forest_snapshot_mainnet_2023-08-14_height_3122221.forest.car.zst(57Gb) 4020.62 MB 3048476.51 MB

BaseLine Mainnet 2 60 forest_snapshot_mainnet_2023-08-14_height_3122221.forest.car.zst(57Gb) 4056.17 MB 3048855.77 MB

BaseLine Mainnet 4 60 forest_snapshot_mainnet_2023-08-14_height_3122221.forest.car.zst(57Gb) 4107.39 MB 3048985.30 MB

BaseLine Mainnet 8 60 forest_snapshot_mainnet_2023-08-14_height_3122221.forest.car.zst(57Gb) 4088.47 MB 3048548.39 MB

BaseLine Mainnet 1 120 forest_snapshot_mainnet_2023-08-14_height_3122221.forest.car.zst(57Gb) 4519.46 MB 3049692.50 MB

BaseLine Mainnet 2 120 forest_snapshot_mainnet_2023-08-14_height_3122221.forest.car.zst(57Gb) 4561.81 MB 3049611.76 MB

BaseLine Mainnet 4 120 forest_snapshot_mainnet_2023-08-14_height_3122221.forest.car.zst(57Gb) 4613.37 MB 3049918.31 MB

BaseLine Mainnet 8 120 forest_snapshot_mainnet_2023-08-14_height_3122221.forest.car.zst(57Gb)

BaseLine Mainnet 8 1500 forest_snapshot_mainnet_2023-08-14_height_3122221.forest.car.zst(57Gb) 14523.47 MB

Change Description	Network	No of Threads	Epochs Validated	Snapshot Info	RSS	VSZ
BaseLine	Calibnet	1	60	forest_snapshot_calibnet_2023-08-14_height_822490.forest.car.zst(1.9Gb)	739.01 MB	3036361.56 MB
BaseLine	Calibnet	2	60	forest_snapshot_calibnet_2023-08-14_height_822490.forest.car.zst(1.9Gb)	762.50 MB	3036432.92 MB
BaseLine	Calibnet	4	60	forest_snapshot_calibnet_2023-08-14_height_822490.forest.car.zst(1.9Gb)	846.67 MB	3036739.99 MB
BaseLine	Calibnet	8	60	forest_snapshot_calibnet_2023-08-14_height_822490.forest.car.zst(1.9Gb)	866.19 MB	3037035.18 MB
BaseLine	Calibnet	1	1999	forest_snapshot_calibnet_2023-08-14_height_822490.forest.car.zst(1.9Gb)	898.11 MB	3036352.50 MB
BaseLine	Calibnet	2	1999	forest_snapshot_calibnet_2023-08-14_height_822490.forest.car.zst(1.9Gb)	878.01 MB	3036441.77 MB
BaseLine	Calibnet	4	1999	forest_snapshot_calibnet_2023-08-14_height_822490.forest.car.zst(1.9Gb)	900.31 MB	3036771.31 MB
BaseLine	Calibnet	8	1999	forest_snapshot_calibnet_2023-08-14_height_822490.forest.car.zst(1.9Gb)	934.74 MB	3037106.39 MB
BaseLine	Mainnet	1	60	forest_snapshot_mainnet_2023-08-14_height_3122221.forest.car.zst(57Gb)	4020.62 MB	3048476.51 MB
BaseLine	Mainnet	2	60	forest_snapshot_mainnet_2023-08-14_height_3122221.forest.car.zst(57Gb)	4056.17 MB	3048855.77 MB
BaseLine	Mainnet	4	60	forest_snapshot_mainnet_2023-08-14_height_3122221.forest.car.zst(57Gb)	4107.39 MB	3048985.30 MB
BaseLine	Mainnet	8	60	forest_snapshot_mainnet_2023-08-14_height_3122221.forest.car.zst(57Gb)	4088.47 MB	3048548.39 MB
BaseLine	Mainnet	1	120	forest_snapshot_mainnet_2023-08-14_height_3122221.forest.car.zst(57Gb)	4519.46 MB	3049692.50 MB
BaseLine	Mainnet	2	120	forest_snapshot_mainnet_2023-08-14_height_3122221.forest.car.zst(57Gb)	4561.81 MB	3049611.76 MB
BaseLine	Mainnet	4	120	forest_snapshot_mainnet_2023-08-14_height_3122221.forest.car.zst(57Gb)	4613.37 MB	3049918.31 MB
BaseLine	Mainnet	8	120	forest_snapshot_mainnet_2023-08-14_height_3122221.forest.car.zst(57Gb)
BaseLine	Mainnet	8	1500	forest_snapshot_mainnet_2023-08-14_height_3122221.forest.car.zst(57Gb)	14523.47 MB

Aug 14 '23 09:08 sudo-shashank

@sudo-shashank What are you measuring?

Aug 14 '23 10:08 lemmih

@sudo-shashank What are you measuring?

trying to measure memory held during the validate run using ps -o rss= -p "$pid" command in a script

Aug 14 '23 11:08 sudo-shashank

How many epochs are validating and how many threads are you using?

Aug 14 '23 12:08 lemmih

(As noted in this issue, memory usage depends entirely on how many threads you're using, so that is vital information you must include in your results.)

Aug 14 '23 12:08 lemmih

How many epochs are validating and how many threads are you using?

60 epochs now, 8 Threads

Aug 14 '23 12:08 sudo-shashank

How many epochs are validating and how many threads are you using?

60 epochs now, single core

For calibnet, that should only take a few seconds to evaluate. You'll get better data if you benchmark for longer than a few seconds.

Aug 14 '23 13:08 lemmih

How many epochs are validating and how many threads are you using?

60 epochs now, single core

When you say a single core, do you mean a single thread? Using a single thread to reproduce a problem that only happens when you use a lot of threads isn't wise.

Aug 14 '23 13:08 lemmih

How many epochs are validating and how many threads are you using?

60 epochs now, single core

When you say a single core, do you mean a single thread? Using a single thread to reproduce a problem that only happens when you use a lot of threads isn't wise.

I checked the config I am using4 cores and I have 16Gb of RAM available, expected peek RSS for forest snapshot validate was 8Gb(4*2Gib ) but I am getting only 4Gib of peek RSS for a mainnet snapshot validation

Aug 14 '23 14:08 sudo-shashank

How many epochs are validating and how many threads are you using?

60 epochs now, single core

When you say a single core, do you mean a single thread? Using a single thread to reproduce a problem that only happens when you use a lot of threads isn't wise.

I checked the config I am using4 cores and I have 16Gb of RAM available, expected peek RSS for forest snapshot validate was 8Gb(4*2Gib ) but I am getting only 4Gib of peek RSS for a mainnet snapshot validation

The exact amount of memory used is not important. What is important is how the memory usage scales with the number of threads.

Aug 14 '23 18:08 lemmih

In my observation so far, the memory usage does not scale with no of threads rather it just scales with no of epochs we validate. More epochs more memory utilisation, peeks to max 15Gib for 1999 epochs of a mainnet snapshot for both mainnet and calibnet

Aug 23 '23 05:08 sudo-shashank

Moving @sudo-shashank to different tasks.

Aug 31 '23 08:08 lemmih

I have tried this various times with forest-tool snapshot validate --check-links=0 forest_mainnet.forest.car --check-stateroots=2000

I have noticed that the memory usage depends where we are in the queue.

For example, when I'm at ~1500 stateroots in the queue the memory usage is steady ~12GB and it manages to cleanup extra memory used just fine, but then it seems to start growing again. With 1100 items left in the queue it's about 15GB. So the further down the rabbit hole we go - the more memory is being used.

The auto-detected parallelism is 10 on my machine.

I have tried chunked approach, where MultiEngine is being reinitialised every n items, but that does not seem to have any impact when chunked by 100. It seems like chunks of 20 have a positive impact on memory footprint, but that affects performance more, because we are forced to wait till the current chunk is processed before starting the next one.

Nov 15 '23 13:11 ruseinov

I have also tried the approach that initialises an engine for each tipset just to see what that does - that slows things down almost to a halt. I'm going to do memory profiling next to see what exactly is eating up the RAM. I'm concerned that the memory does not get cleaned up properly with the chunked approach and reinitialisation.

Nov 15 '23 18:11 ruseinov

forest forest copied to clipboard

snapshot validate: unbound memory

forest
forest copied to clipboard