aptos-core icon indicating copy to clipboard operation
aptos-core copied to clipboard

Use DashMap as the hash map in some places.

Open grao1991 opened this issue 2 years ago • 12 comments

Description

Observed lock contention at these places. Let's start with Dashmap. We might be able to make additional optimization in a different way (e.g. cache hash, shard manually, etc.) in the future.

Test Plan

grao1991 avatar Jan 24 '23 22:01 grao1991

Forge is running suite land_blocking on 335099f380a66dc69b452e9be6c703cefcc722a8

github-actions[bot] avatar Jan 24 '23 22:01 github-actions[bot]

Forge is running suite compat on testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 335099f380a66dc69b452e9be6c703cefcc722a8

github-actions[bot] avatar Jan 24 '23 22:01 github-actions[bot]

:white_check_mark: Forge suite land_blocking success on 335099f380a66dc69b452e9be6c703cefcc722a8

performance benchmark with full nodes : 5682 TPS, 6995 ms latency, 9000 ms p99 latency,(!) expired 100 out of 2426320 txns
Test Ok

github-actions[bot] avatar Jan 24 '23 23:01 github-actions[bot]

:white_check_mark: Forge suite compat success on testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 335099f380a66dc69b452e9be6c703cefcc722a8

Compatibility test results for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 335099f380a66dc69b452e9be6c703cefcc722a8 (PR)
1. Check liveness of validators at old version: testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b
compatibility::simple-validator-upgrade::liveness-check : 7255 TPS, 5337 ms latency, 7000 ms p99 latency,no expired txns
2. Upgrading first Validator to new version: 335099f380a66dc69b452e9be6c703cefcc722a8
compatibility::simple-validator-upgrade::single-validator-upgrade : 4770 TPS, 8328 ms latency, 11200 ms p99 latency,no expired txns
3. Upgrading rest of first batch to new version: 335099f380a66dc69b452e9be6c703cefcc722a8
compatibility::simple-validator-upgrade::half-validator-upgrade : 4009 TPS, 10537 ms latency, 14500 ms p99 latency,no expired txns
4. upgrading second batch to new version: 335099f380a66dc69b452e9be6c703cefcc722a8
compatibility::simple-validator-upgrade::rest-validator-upgrade : 5976 TPS, 6699 ms latency, 11900 ms p99 latency,no expired txns
5. check swarm health
Compatibility test for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 335099f380a66dc69b452e9be6c703cefcc722a8 passed
Test Ok

github-actions[bot] avatar Jan 24 '23 23:01 github-actions[bot]

@grao1991 - Looking at the land-blocking TPS, I think this is actually introducing a regression. I recall @gelash also tried similar change in past and we observed regression in land-blocking Forge. Can we compare the get state value counter for storage with and without Dashmap?

Also, IIUC, the caching in cached_state_view is useless today, as the MVHashmap already caches storage read, so I would be surprised if any improvement in cached_state_view is actually making any difference.

sitalkedia avatar Jan 24 '23 23:01 sitalkedia

Some numbers in the summary to give people an impression how much this is helping (and under what circumstances) would be nice.

For the cached state view one, I've observed significant overhead here (even for cache hit) in storage only benchmark (with fake executor simulate reads). Before this change, increasing the number of threads will significantly increase the latency here, from 8 threads -> <10us to 32 threads -> ~60us. With this change we can still get ~10us for cache hit when using 32 threads.

grao1991 avatar Jan 24 '23 23:01 grao1991

@grao1991 storing the hash inside statekey would be awesome if you can get to it, I think that should really help

gelash avatar Jan 25 '23 00:01 gelash

Can we indeed make sure there is no regression? If there is a regression, there is one place where we collect with dashmap where things could have gotten slower. Otherwise, if the lock is indeed contended, unclear why it would be significantly worse.

regarding usefulness though, @sitalkedia in case some item keeps getting read, but not written to, MVHashMap does not cache anything, so the storage cache will be important. Also, isn't cached_state_view useful between blocks ( or is it getting flushed too, I don't remember)?

I think it totally depends on how many threads we have. On forge only 8 threads, so the benefits are very minimal comparing to the overhead.

grao1991 avatar Jan 25 '23 00:01 grao1991

i don’t think the regression comes from this pr, i saw 5k number yesterday from another pr

zekun000 avatar Jan 25 '23 00:01 zekun000

https://github.com/aptos-labs/aptos-core/pull/6273#issuecomment-1401226431

zekun000 avatar Jan 25 '23 00:01 zekun000

regarding usefulness though, @sitalkedia in case some item keeps getting read, but not written to, MVHashMap does not cache anything, so the storage cache will be important. Also, isn't cached_state_view useful between blocks ( or is it getting flushed too, I don't remember)?

That's interesting, that we don't cache reads in MVHashMap. I guess in that case the storage cache is helpful. It is not cached across blocks BTW.

sitalkedia avatar Jan 25 '23 01:01 sitalkedia

That's interesting, that we don't cache reads in MVHashMap. I guess in that case the storage cache is helpful. It is not cached across blocks BTW.

The CachedStateView::state_cache is not, but the tree inside does have nodes from previous blocks that's not yet in StateStore.

msmouse avatar Jan 25 '23 02:01 msmouse

@grao1991 storing the hash inside statekey would be awesome if you can get to it, I think that should really help

Let me figure out how to do this in an elegant way later.

grao1991 avatar Jan 26 '23 21:01 grao1991

Forge is running suite compat on testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 0e647906dd03316df1b81dde2f022d342243b099

github-actions[bot] avatar Jan 26 '23 23:01 github-actions[bot]

Forge is running suite land_blocking on 0e647906dd03316df1b81dde2f022d342243b099

github-actions[bot] avatar Jan 26 '23 23:01 github-actions[bot]

:white_check_mark: Forge suite compat success on testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 0e647906dd03316df1b81dde2f022d342243b099

Compatibility test results for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 0e647906dd03316df1b81dde2f022d342243b099 (PR)
1. Check liveness of validators at old version: testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b
compatibility::simple-validator-upgrade::liveness-check : 8198 TPS, 4705 ms latency, 6700 ms p99 latency,no expired txns
2. Upgrading first Validator to new version: 0e647906dd03316df1b81dde2f022d342243b099
compatibility::simple-validator-upgrade::single-validator-upgrade : 4698 TPS, 8786 ms latency, 12000 ms p99 latency,no expired txns
3. Upgrading rest of first batch to new version: 0e647906dd03316df1b81dde2f022d342243b099
compatibility::simple-validator-upgrade::half-validator-upgrade : 4576 TPS, 8664 ms latency, 11600 ms p99 latency,no expired txns
4. upgrading second batch to new version: 0e647906dd03316df1b81dde2f022d342243b099
compatibility::simple-validator-upgrade::rest-validator-upgrade : 7021 TPS, 5613 ms latency, 10300 ms p99 latency,no expired txns
5. check swarm health
Compatibility test for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 0e647906dd03316df1b81dde2f022d342243b099 passed
Test Ok

github-actions[bot] avatar Jan 26 '23 23:01 github-actions[bot]

:white_check_mark: Forge suite land_blocking success on 0e647906dd03316df1b81dde2f022d342243b099

performance benchmark with full nodes : 6217 TPS, 6377 ms latency, 9900 ms p99 latency,(!) expired 520 out of 2655300 txns
Test Ok

github-actions[bot] avatar Jan 26 '23 23:01 github-actions[bot]