massa
massa copied to clipboard
Study alternative containers for Storage
The shared Storage has few concurrent writes by a couple of threads, heavy concurrent reads by many threads.
We are currently planning on using Arc<RwLock<PreHashMap<BlockId, Arc<RwLock<BlockInfo>>>>
as the underlying storage container.
Now I recently watched this: https://www.youtube.com/watch?v=HJ-719EGIts
Basically it could be possible to do most of the work we want to do lock-free, with much higher performance !
Then I looked for stuff and found crates proposing this kind of stuff:
- https://docs.rs/flurry/latest/flurry/
- https://github.com/xacrimon/dashmap (with comparative benchmark: https://github.com/xacrimon/conc-map-bench )
- https://github.com/jonhoo/left-right
- https://docs.rs/evmap/latest/evmap/
It looks like those structures can yield an order-of-magnitude performance improvement under the right conditions (that look like the conditions we have).
However, depending on the implementation, there could be some unexpected problems such as a temporary lack of read/write consistency.
We need to:
- study the proposed containers (and look for others)
- get a more precise idea of the Read/Write concurrency and throughput expectations in our architecture. Maybe wait to have the first version of Storage implemented to do an actual benchmark
- try changing the underlying Storage container with the various options and bench the impact on performance
Basically it could be possible to do most of the work we want to do lock-free, with much higher performance !
I am doubtful of the higher performance of these things.
For example, see the below excerpt from https://www.chromium.org/developers/lock-and-condition-variable/
Programmers assume that locking is expensive, and that using atomic operations will be more efficient. But in reality, acquiring and releasing a lock is cheaper than a cache miss; attention to cache behaviour is usually a more fruitful way to improve performance. Furthermore, lock-free data structures are often more expensive than using locks. This is because a lock allows arbitrary changes to be made to a complex data structure; if the same changes must be made without a lock, the result is likely to take more atomic read-modify-write instructions, not fewer.
In our specific situation, we have the below structure:
Arc<RwLock<Map<BlockId, Arc<RwLock<StoredBlock>>>>>
Most of the "work" will occurs while locking an individual StoredBlock
, so there should be little contention on the lock around the top-level map.
That lock around the top-level map could be replaced by using some sort of lock-free map, however we should consider:
- Probably losing the optimization of our current pre-hashed maps.
- That going lock-free there probably doesn't really give any relevant performance boost.
- That we also may want the synchronization that comes with a lock. For example if a block is pruned, we may not want it to be able to be read logically after that point. Lock-free structures may have looser synchronization, and/or if they do allow such "hard" synchronization, they might not be any faster than simply a lock around a map.
I don't have a full vision of the code so I can't talk about if the synchronization will be a real problem.
But regarding this SO answer : https://stackoverflow.com/a/5680988 it may depend on the algorithm and also in the cases. If we look at this benchmark : https://github.com/xacrimon/conc-map-bench with a lot of reads and less writing it seems that we can have a huge improvement.
Tried to run the benchmark that is pretty cool : https://github.com/ibraheemdev/conc-map-bench (this fork have a fix for flurry) with sha256 with several crates (bitcoin_hashes and RustCrypto) but they don't implement Hasher
trait.
Following this issue : https://github.com/RustCrypto/hashes/issues/49 it's not a good way to implement this trait for cryptographic hashes. Also, following the SO answer above (and my mind) the best way is to test in "close" real condition by updating the current existing benchmark on https://github.com/xacrimon/conc-map-bench to have possibly to give a number of writing threads and reading threads and also use our proper hashs.
Btw @damip left-right is a child project of evmap : https://github.com/jonhoo/left-right/commit/abe3769feda942d26b6f24eac99035e1c80e04e0
I have fork and run the benchmark with the hash of Massa and prehash on massa_models
. Here is the results:
Scenario Heavy Read :
read 98%
insert 1%
remove 1%
Scenario Exchange :
read 10%
insert 40%
remove 40%
update 10%
Scenario rapid grow :
read 5%
insert 80%
remove 5%
update 10%
I am working on a fork of bustle (crate used in the benchmark) to split the read and write operations in specific threads and be able to specify how many read/write threads we want.
Links :
- Fork of the benchmark : https://github.com/AurelienFT/conc-map-bench
- Fork of bustle : https://github.com/AurelienFT/bustle
@AurelienFT Thank you for your investigations.
It is expected that those maps will show an improvement over locks, since these lock-free structures are specifically designed with that in mind.
The question is whether it will show an improvement in our specific application, which will depend on whether the reading/writing is on the critical path.
Could be worth running benchmarks after the refactoring has been completed, and after having determined that the application spends a significant amount of time in that read/write workflow(either manipulating the map, or waiting on the lock).
The currently proposed API of Storage
would make such experimentation quite easy, since the use of the lock around the top-level map is hidden from the user.
Could be worth running benchmarks after the refactoring has been completed, and after having determined that the application spends a significant amount of time in that read/write workflow(either manipulating the map, or waiting on the lock).
I totally agree. The real case situation is always the best and having metrics measurement around the node is pretty useful and can help us see real improvements or not.
I have successfully modified the two crates to allow a number of read and write threads. It's a bit "hacky" and if we want to contribute it upstream I will have to refactor it a bit but I don't think it's our priority right now.
Here is the results :
Scenario Heavy Read :
read 98%
insert 1%
remove 1%
Scenario Exchange :
read 10%
insert 40%
remove 40%
update 10%
Scenario rapid grow :
read 5%
insert 80%
remove 5%
update 10%
If you want me to rerun with different data/number of threads it's possible you can also run it directly with my own fork of the crates :
- Fork of the benchmark : https://github.com/AurelienFT/conc-map-bench
- Fork of bustle : https://github.com/AurelienFT/bustle
To run the benchmark on the fork you can run this command :
./scripts/bench.bash --read-threads="9" --write-threads="2" --read-threads="8" --write-threads="2" --read-threads="7" --write-threads="2" --read-threads="7" --write-threads="3" --read-threads="7" --write-threads="4"
The a --read-threads flag must always be paired with a --write-threads which will create a pair of test for benchmark and as shown above you can add multiples. (Will try to change to accept comma list directly).
To render the plots :
./scripts/plot.bash
Conclusion
We can see that when we have a lot of read test DashMap or Flurry could be nice. And when we have a lot of write the locks start to be more efficient (which is not our case).
Dashmap works with sharding, that's smart, not magic and scalable!! :bearded_person: :love_you_gesture:
Dashmap works with sharding, that's smart, not magic and scalable!! 🧔 🤟
Flurry also :
This map usually acts as a binned (bucketed) hash table. Each key-value mapping is held in a BinEntry
New benchamrk with random data as value instead of 0
Here is the results :
Scenario Heavy Read :
read 98%
insert 1%
remove 1%
Scenario Exchange :
read 10%
insert 40%
remove 40%
update 10%
Scenario rapid grow :
read 5%
insert 80%
remove 5%
update 10%
@gterzian Do you have a branch with the implementation of storage, used at least one time and that can be run (not necessary on the testnet but on the labnet for example) ?
As there is an interface and the change of storage system will be transparent for the rest of the code. I will try to make sub branches of this one with the implementation of Storage using the Dashmap
crate and the Flurry
crate so that we will be able to test and benchmark it.
Also if you have some tasks to help you on the integration of the storage so that I can help you on that because the more usage we have the more the bench will be !
@AurelienFT Thank you for your enthousiasm.
I would like to reiterate what I wrote above:
Could be worth running benchmarks after the refactoring has been completed, and after having determined that the application spends a significant amount of time in that read/write workflow(either manipulating the map, or waiting on the lock).
For clarity, this means that I don't think we should be making any efforts on this until the following has been reached:
- We have (largely) finished the integration of storage into the project.
- We have noticed that the application spends a significant amount of time waiting on the lock on the top-level map becoming available.
Point 2 is especially important, since we can always optimize things, but most of it is wasted effort since if it will not materially affect overall performance of the application.
As an example, we could have a 10 times improvement on "something", however this would be meaningless if the app spends only 0,00001% of its time doing that "thing".
if you have some tasks to help you on the integration of the storage so that I can help you
Appreciated, the main discussion would be over at https://github.com/massalabs/massa/issues/2178.
I think a good start would be to familiarize yourself with the various issues and currently open PR. I think there will be more specific tasks next week when some initial work is finished.
because the more usage we have the more the bench will be
As per above, the goal is not to optimize this, unless we notice it becomes a problem :)
I have run a simulator with 26 nodes and 100 tx/s on main with the storage and a branch where I implement DashMap and flurry and result are very close so it don't seems to be a bottleneck :
RwLock
DashMap
Flurry
This can be interesting to revisit after Storage is used in the whole codebase
The performance of the current version is good. Closing