massa Study alternative containers for Storage

The shared Storage has few concurrent writes by a couple of threads, heavy concurrent reads by many threads.

We are currently planning on using Arc<RwLock<PreHashMap<BlockId, Arc<RwLock<BlockInfo>>>> as the underlying storage container.

Now I recently watched this: https://www.youtube.com/watch?v=HJ-719EGIts

Basically it could be possible to do most of the work we want to do lock-free, with much higher performance !

Then I looked for stuff and found crates proposing this kind of stuff:

https://docs.rs/flurry/latest/flurry/
https://github.com/xacrimon/dashmap (with comparative benchmark: https://github.com/xacrimon/conc-map-bench )
https://github.com/jonhoo/left-right
https://docs.rs/evmap/latest/evmap/

It looks like those structures can yield an order-of-magnitude performance improvement under the right conditions (that look like the conditions we have).

However, depending on the implementation, there could be some unexpected problems such as a temporary lack of read/write consistency.

We need to:

study the proposed containers (and look for others)
get a more precise idea of the Read/Write concurrency and throughput expectations in our architecture. Maybe wait to have the first version of Storage implemented to do an actual benchmark
try changing the underlying Storage container with the various options and bench the impact on performance

Feb 26 '22 11:02 damip

Basically it could be possible to do most of the work we want to do lock-free, with much higher performance !

I am doubtful of the higher performance of these things.

For example, see the below excerpt from https://www.chromium.org/developers/lock-and-condition-variable/

Programmers assume that locking is expensive, and that using atomic operations will be more efficient. But in reality, acquiring and releasing a lock is cheaper than a cache miss; attention to cache behaviour is usually a more fruitful way to improve performance. Furthermore, lock-free data structures are often more expensive than using locks. This is because a lock allows arbitrary changes to be made to a complex data structure; if the same changes must be made without a lock, the result is likely to take more atomic read-modify-write instructions, not fewer.

In our specific situation, we have the below structure:

Arc<RwLock<Map<BlockId, Arc<RwLock<StoredBlock>>>>>

Most of the "work" will occurs while locking an individual StoredBlock, so there should be little contention on the lock around the top-level map.

That lock around the top-level map could be replaced by using some sort of lock-free map, however we should consider:

Probably losing the optimization of our current pre-hashed maps.
That going lock-free there probably doesn't really give any relevant performance boost.
That we also may want the synchronization that comes with a lock. For example if a block is pruned, we may not want it to be able to be read logically after that point. Lock-free structures may have looser synchronization, and/or if they do allow such "hard" synchronization, they might not be any faster than simply a lock around a map.

Mar 01 '22 09:03 gterzian

I don't have a full vision of the code so I can't talk about if the synchronization will be a real problem.

But regarding this SO answer : https://stackoverflow.com/a/5680988 it may depend on the algorithm and also in the cases. If we look at this benchmark : https://github.com/xacrimon/conc-map-bench with a lot of reads and less writing it seems that we can have a huge improvement.

Mar 01 '22 10:03 AurelienFT

Tried to run the benchmark that is pretty cool : https://github.com/ibraheemdev/conc-map-bench (this fork have a fix for flurry) with sha256 with several crates (bitcoin_hashes and RustCrypto) but they don't implement Hasher trait.

Following this issue : https://github.com/RustCrypto/hashes/issues/49 it's not a good way to implement this trait for cryptographic hashes. Also, following the SO answer above (and my mind) the best way is to test in "close" real condition by updating the current existing benchmark on https://github.com/xacrimon/conc-map-bench to have possibly to give a number of writing threads and reading threads and also use our proper hashs.

Mar 01 '22 14:03 AurelienFT

Btw @damip left-right is a child project of evmap : https://github.com/jonhoo/left-right/commit/abe3769feda942d26b6f24eac99035e1c80e04e0

Mar 01 '22 15:03 AurelienFT

I have fork and run the benchmark with the hash of Massa and prehash on massa_models. Here is the results:

Scenario Heavy Read :

read   98%
insert  1%
remove  1%

ReadHeavy latency ReadHeavy throughput

Scenario Exchange :

read    10%
insert  40%
remove  40%
update  10%

Exchange latency Exchange throughput

Scenario rapid grow :

read    5%
insert 80%
remove  5%
update 10%

RapidGrow latency RapidGrow throughput

I am working on a fork of bustle (crate used in the benchmark) to split the read and write operations in specific threads and be able to specify how many read/write threads we want.

Links :

Fork of the benchmark : https://github.com/AurelienFT/conc-map-bench
Fork of bustle : https://github.com/AurelienFT/bustle

Mar 01 '22 17:03 AurelienFT

@AurelienFT Thank you for your investigations.

It is expected that those maps will show an improvement over locks, since these lock-free structures are specifically designed with that in mind.

The question is whether it will show an improvement in our specific application, which will depend on whether the reading/writing is on the critical path.

Could be worth running benchmarks after the refactoring has been completed, and after having determined that the application spends a significant amount of time in that read/write workflow(either manipulating the map, or waiting on the lock).

The currently proposed API of Storage would make such experimentation quite easy, since the use of the lock around the top-level map is hidden from the user.

Mar 02 '22 04:03 gterzian

Could be worth running benchmarks after the refactoring has been completed, and after having determined that the application spends a significant amount of time in that read/write workflow(either manipulating the map, or waiting on the lock).

I totally agree. The real case situation is always the best and having metrics measurement around the node is pretty useful and can help us see real improvements or not.

Mar 02 '22 08:03 AurelienFT

I have successfully modified the two crates to allow a number of read and write threads. It's a bit "hacky" and if we want to contribute it upstream I will have to refactor it a bit but I don't think it's our priority right now.

Here is the results :

Scenario Heavy Read :

read   98%
insert  1%
remove  1%

ReadHeavy throughput ReadHeavy latency

Scenario Exchange :

read    10%
insert  40%
remove  40%
update  10%

Exchange throughput Exchange latency

Scenario rapid grow :

read    5%
insert 80%
remove  5%
update 10%

RapidGrow throughput RapidGrow latency

If you want me to rerun with different data/number of threads it's possible you can also run it directly with my own fork of the crates :

Fork of the benchmark : https://github.com/AurelienFT/conc-map-bench
Fork of bustle : https://github.com/AurelienFT/bustle

To run the benchmark on the fork you can run this command :

./scripts/bench.bash --read-threads="9" --write-threads="2" --read-threads="8"  --write-threads="2" --read-threads="7"  --write-threads="2" --read-threads="7"  --write-threads="3" --read-threads="7"  --write-threads="4"

The a --read-threads flag must always be paired with a --write-threads which will create a pair of test for benchmark and as shown above you can add multiples. (Will try to change to accept comma list directly).

To render the plots :

./scripts/plot.bash

Conclusion

We can see that when we have a lot of read test DashMap or Flurry could be nice. And when we have a lot of write the locks start to be more efficient (which is not our case).

Mar 02 '22 16:03 AurelienFT

Dashmap works with sharding, that's smart, not magic and scalable!! :bearded_person: :love_you_gesture:

Mar 03 '22 09:03 adrien-zinger

Dashmap works with sharding, that's smart, not magic and scalable!! 🧔 🤟

Flurry also :

This map usually acts as a binned (bucketed) hash table. Each key-value mapping is held in a BinEntry

Mar 03 '22 11:03 AurelienFT

New benchamrk with random data as value instead of 0

Here is the results :

Scenario Heavy Read :

read   98%
insert  1%
remove  1%

ReadHeavy throughput ReadHeavy latency

Scenario Exchange :

read    10%
insert  40%
remove  40%
update  10%

Exchange throughput Exchange latency

Scenario rapid grow :

read    5%
insert 80%
remove  5%
update 10%

RapidGrow throughput RapidGrow latency

Mar 03 '22 11:03 AurelienFT

@gterzian Do you have a branch with the implementation of storage, used at least one time and that can be run (not necessary on the testnet but on the labnet for example) ?

As there is an interface and the change of storage system will be transparent for the rest of the code. I will try to make sub branches of this one with the implementation of Storage using the Dashmap crate and the Flurrycrate so that we will be able to test and benchmark it.

Also if you have some tasks to help you on the integration of the storage so that I can help you on that because the more usage we have the more the bench will be !

Mar 03 '22 16:03 AurelienFT

@AurelienFT Thank you for your enthousiasm.

I would like to reiterate what I wrote above:

Could be worth running benchmarks after the refactoring has been completed, and after having determined that the application spends a significant amount of time in that read/write workflow(either manipulating the map, or waiting on the lock).

For clarity, this means that I don't think we should be making any efforts on this until the following has been reached:

We have (largely) finished the integration of storage into the project.
We have noticed that the application spends a significant amount of time waiting on the lock on the top-level map becoming available.

Point 2 is especially important, since we can always optimize things, but most of it is wasted effort since if it will not materially affect overall performance of the application.

As an example, we could have a 10 times improvement on "something", however this would be meaningless if the app spends only 0,00001% of its time doing that "thing".

if you have some tasks to help you on the integration of the storage so that I can help you

Appreciated, the main discussion would be over at https://github.com/massalabs/massa/issues/2178.

I think a good start would be to familiarize yourself with the various issues and currently open PR. I think there will be more specific tasks next week when some initial work is finished.

because the more usage we have the more the bench will be

As per above, the goal is not to optimize this, unless we notice it becomes a problem :)

Mar 04 '22 04:03 gterzian

I have run a simulator with 26 nodes and 100 tx/s on main with the storage and a branch where I implement DashMap and flurry and result are very close so it don't seems to be a bottleneck :

RwLock

100_operations_storage_26_nodes

DashMap

100_operations_storage_lockfree_26_nodes png

Flurry

100_operations_storage_lockfree_flurry_26_nodes png

Mar 30 '22 15:03 AurelienFT

This can be interesting to revisit after Storage is used in the whole codebase

Aug 24 '22 13:08 damip

The performance of the current version is good. Closing

Jun 05 '23 09:06 damip

massa massa copied to clipboard

Study alternative containers for Storage

Scenario Heavy Read :

Scenario Exchange :

Scenario rapid grow :

Scenario Heavy Read :

Scenario Exchange :

Scenario rapid grow :

Conclusion

New benchamrk with random data as value instead of 0

Scenario Heavy Read :

Scenario Exchange :

Scenario rapid grow :

RwLock

DashMap

Flurry

massa
massa copied to clipboard