[DO NOT MERGE] perf from moc artifact
Test https://github.com/dfinity/motoko/pull/3955
Note Diffing the performance result against the published result from main branch. Unchanged benchmarks are omitted.
Map
| binary_size | generate 50k | max mem | batch_get 50 | batch_put 50 | batch_remove 50 | |
|---|---|---|---|---|---|---|
| hashmap | 196_227 | 2_387_017_574 | 9_102_052 | 1_293_015 | 689_295_883 | 1_224_704 |
| triemap | 201_786 | 2_286_509_386 | 9_715_900 | 891_925 | 2_111_535 | 1_187_671 |
| rbtree | 198_958 | 2_024_735_614 | 8_902_160 | 787_768 | 1_839_147 | 991_630 |
| splay | 197_868 | 2_305_505_782 | 8_702_096 | 1_258_328 | 2_175_053 | 1_259_845 |
| btree | 235_291 ($\textcolor{green}{-0.02\%}$) | 2_116_952_097 ($\textcolor{green}{-0.25\%}$) | 7_556_172 | 936_630 | 1_948_829 ($\textcolor{green}{-0.31\%}$) | 991_671 |
| zhenya_hashmap | 190_491 | 1_855_281_618 | 9_301_800 | 745_902 | 1_651_310 | 752_198 |
| btreemap_rs | 514_775 | 115_994_744 | 1_638_400 | 59_433 | 137_855 | 61_795 |
| hashmap_rs | 502_772 | 53_333_947 | 1_835_008 | 21_070 | 63_601 | 22_484 |
Priority queue
Note Same as main branch, skipping.
MoVM
Note Same as main branch, skipping.
Basic DAO
| binary_size | init | transfer_token | submit_proposal | vote_proposal | |
|---|---|---|---|---|---|
| Motoko | 291_432 ($\textcolor{green}{-0.02\%}$) | 44_505 ($\textcolor{red}{0.13\%}$) | 20_039 ($\textcolor{red}{0.58\%}$) | 14_077 ($\textcolor{green}{-0.01\%}$) | 16_739 ($\textcolor{red}{0.35\%}$) |
| Rust | 940_461 | 541_441 | 102_463 | 125_485 | 137_030 |
DIP721 NFT
Note Same as main branch, skipping.
Heartbeat
| binary_size | heartbeat | |
|---|---|---|
| Motoko | 156_821 | 5_324 ($\textcolor{green}{-40.70\%}$) |
| Rust | 35_608 | 1_127 ($\textcolor{red}{91.99\%}$) |
Timer
Note Same as main branch, skipping.
Note The flamegraph link only works after you merge. Unchanged benchmarks are omitted.
Collection libraries
Measure different collection libraries written in both Motoko and Rust.
The library names with _rs suffix are written in Rust; the rest are written in Motoko.
We use the same random number generator with fixed seed to ensure that all collections contain the same elements, and the queries are exactly the same. Below we explain the measurements of each column in the table:
- generate 50k. Insert 50k Nat32 integers into the collection. For Motoko collections, it usually triggers the GC; the rest of the column are not likely to trigger GC.
- max mem. For Motoko, it reports
rts_max_live_sizeaftergeneratecall; For Rust, it reports the Wasm's memory page * 32Kb. - batch_get 50. Find 50 elements from the collection.
- batch_put 50. Insert 50 elements to the collection.
- batch_remove 50. Remove 50 elements from the collection.
💎 Takeaways
- The platform only charges for instruction count. Data structures which make use of caching and locality have no impact on the cost.
- We have a limit on the maximal cycles per round. This means asymptotic behavior doesn't matter much. We care more about the performance up to a fixed N. In the extreme cases, you may see an
O(10000 nlogn)algorithm hitting the limit, while anO(n^2)algorithm runs just fine. - Amortized algorithms/GC may need to be more eager to avoid hitting the cycle limit on a particular round.
- Rust costs more cycles to process complicated Candid data, but it is more efficient in performing core computations.
Note
- The Candid interface of the benchmark is minimal, therefore the serialization cost is negligible in this measurement.
- Due to the instrumentation overhead and cycle limit, we cannot profile computations with large collections. Hopefully, when deterministic time slicing is ready, we can measure the performance on larger memory footprint.
hashmapuses amortized data structure. When the initial capacity is reached, it has to copy the whole array, thus the cost ofbatch_put 50is much higher than other data structures.hashmap_rsuses thefxhashcrate, which is the same asstd::collections::HashMap, but with a deterministic hasher. This ensures reproducible result.btreecomes from Byron Becker's stable BTreeMap library.zhenya_hashmapcomes from Zhenya Usenko's stable HashMap library.- The MoVM table measures the performance of an experimental implementation of Motoko interpreter. External developers can ignore this table for now.
Map
| binary_size | generate 50k | max mem | batch_get 50 | batch_put 50 | batch_remove 50 | |
|---|---|---|---|---|---|---|
| hashmap | 196_227 | 2_387_017_574 | 9_102_052 | 1_293_015 | 689_295_883 | 1_224_704 |
| triemap | 201_786 | 2_286_509_386 | 9_715_900 | 891_925 | 2_111_535 | 1_187_671 |
| rbtree | 198_958 | 2_024_735_614 | 8_902_160 | 787_768 | 1_839_147 | 991_630 |
| splay | 197_868 | 2_305_505_782 | 8_702_096 | 1_258_328 | 2_175_053 | 1_259_845 |
| btree | 235_291 | 2_116_952_097 | 7_556_172 | 936_630 | 1_948_829 | 991_671 |
| zhenya_hashmap | 190_491 | 1_855_281_618 | 9_301_800 | 745_902 | 1_651_310 | 752_198 |
| btreemap_rs | 514_775 | 115_994_744 | 1_638_400 | 59_433 | 137_855 | 61_795 |
| hashmap_rs | 502_772 | 53_333_947 | 1_835_008 | 21_070 | 63_601 | 22_484 |
Priority queue
| binary_size | heapify 50k | mem | pop_min 50 | put 50 | |
|---|---|---|---|---|---|
| heap | 181_726 | 793_253_862 | 1_400_024 | 385_321 | 822_756 |
| heap_rs | 473_458 | 5_041_433 | 819_200 | 53_243 | 22_092 |
MoVM
| binary_size | generate 10k | max mem | batch_get 50 | batch_put 50 | batch_remove 50 | |
|---|---|---|---|---|---|---|
| hashmap | 196_227 | 477_464_161 | 1_820_844 | 1_291_042 | 138_897_096 | 1_222_118 |
| hashmap_rs | 502_772 | 10_984_247 | 950_272 | 20_385 | 62_907 | 21_374 |
| imrc_hashmap_rs | 513_980 | 19_919_391 | 1_572_864 | 31_519 | 120_207 | 37_618 |
| movm_rs | 2_092_441 | 1_017_324_475 | 2_654_208 | 2_494_635 | 6_477_172 | 5_106_080 |
| movm_dynamic_rs | 2_295_206 | 496_274_407 | 2_129_920 | 1_951_981 | 2_709_572 | 1_950_006 |
Sample Dapps
Measure the performance of some typical dapps:
- Basic DAO,
with
heartbeatdisabled to make profiling easier. We have a separate benchmark to measure heartbeat performance. - DIP721 NFT
Note
- The cost difference is mainly due to the Candid serialization cost.
- Motoko statically compiles/specializes the serialization code for each method, whereas in Rust, we use
serdeto dynamically deserialize data based on data on the wire.- We could improve the performance on the Rust side by using parser combinators. But it is a challenge to maintain the ergonomics provided by
serde.- For real-world applications, we tend to send small data for each endpoint, which makes the Candid overhead in Rust tolerable.
Basic DAO
| binary_size | init | transfer_token | submit_proposal | vote_proposal | |
|---|---|---|---|---|---|
| Motoko | 291_432 | 44_505 | 20_039 | 14_077 | 16_739 |
| Rust | 940_461 | 541_441 | 102_463 | 125_485 | 137_030 |
DIP721 NFT
| binary_size | init | mint_token | transfer_token | |
|---|---|---|---|---|
| Motoko | 244_622 | 13_379 | 24_678 | 5_358 |
| Rust | 1_005_637 | 144_162 | 375_896 | 94_757 |
Heartbeat / Timer
Measure the cost of empty heartbeat and timer job.
setTimermeasures both thesetTimer(0)method and the execution of empty job.- It is not easy to reliably capture the above events in one flamegraph, as the implementation detail
of the replica can affect how we measure this. Typically, a correct flamegraph contains both
setTimerandcanister_global_timerfunction. If it's not there, we may need to adjust the script.
Heartbeat
| binary_size | heartbeat | |
|---|---|---|
| Motoko | 156_821 | 5_324 |
| Rust | 35_608 | 1_127 |
Timer
| binary_size | setTimer | cancelTimer | |
|---|---|---|---|
| Motoko | 164_347 | 19_476 | 1_907 |
| Rust | 524_361 | 55_152 | 10_417 |
@chenyan-dfinity
against the published result from main branch
it is unclear if this refers to this repo's main branch or dfinity/motoko's master branch.
If it is the former, I would have expected the second run (baseline 0.8.8) to show less improvement.
I just rerun the CI, it looks good now. It's against this repo's main branch. The main branch needs some time to update the report after the PR is merged. You probably trigger the CI too soon.