async-std
async-std copied to clipboard
Shift in performance characteristic between 1.6.2 and master
NOTE: Not a complaint just an observation that might be useful to discuss / design / improve that scheduling model.
Hi I ran our benchmarks with 1.6.2 and current master after #836 was merged and I noticed a rather significant change in performance characteristics.
The following two tables show the throughput of MB/s in the benchmark at a different number of concurrent tasks and different CPU setups.
The benchmarks are run on a Ryzen Threadripper 24 Core (48 thread) CPU. It's worth noting that these CPU houses (simplified) 8 CCXs (3 cores/6 threads each) that share their cache, otherwise they communicate over an internal bus (somewhat acting as different NUMA domains).
So in the 3 core test, the three cores specifically are picked to be on the same CCX, meaning they share their cache, in the 48 core tests all cores (or threads) are used meaning no chances are shared by default.
It looks like that in 1.6.2 tasks that were related were rather likely to a be scheduled on the same executor (or an executor on the same CCX) so the 3 steps in the benchmark, deserialize -> process -> serialize were more often executed on the same CCX (leading to higher performance in the 48 thread case) however overall was less performant (in the 3 core case).
master switches this around, it seems to more often than not schedule-related tasks on new cores leading to a (sometimes much) lower performance on the 48 core case but much better performance on the limited core set with the shared cache.
We observed the same with threads (to a stronger degree), that threads outperformed tasks on the 3 core setup but unless pinned would be scheduled around so much that the loss of cache would obliterate performance.
It seems that master is very friendly to UMA systems (or generally systems where all cores share cache) but less friendly to NUMA systems that don't have that luxury (counting the Ryzen platform here even so it's a bit of a special case).
1.6.2 (throughput in MB/s)
| # tasks/2 | 3 cores | 48 cores |
|---|---|---|
| 1 | 181.7 | 84.9 |
| 2 | 276.1 | 337.3 |
| 4 | 291.5 | 333.3 |
| 8 | 260.9 | 325.9 |
| 16 | 254.7 | 320.2 |
| 32 | 250.3 | 302.8 |
| 64 | 243.2 | 305.8 |
master (throughput in MB/s)
| # tasks/2 | 3 cores | 48 cores |
|---|---|---|
| 1 | 273.8 | 128.6 |
| 2 | 339.1 | 292.3 |
| 4 | 342.7 | 234.2 |
| 8 | 308.3 | 209.9 |
| 16 | 265.6 | 210.9 |
| 32 | 245.9 | 201.2 |
| 64 | 229.8 | 197.4 |
cc/ @stjepang
Ok this actually seems like great news. I'm pretty confident I know what's going on and am hopeful that we can get the best of both worlds! I'll make an experiment later today and make a branch of async-std so we can compare it to 1.6.2 and master.
Briefly:
- 1.6.2 was notifying executor threads in a silly round-robin manner, while master notifies executor threads in a LIFO manner. This is better because by waking the thread that went to sleep most recently, we have the highest chances of keeping caches warm. I believe this is what contributes to better performance on UMA systems.
- 1.6.2 schedules tasks onto the local executor thread, while master schedules tasks onto the global queue (lots of contention!) because I was lazy when making multitask. I believe this is what makes performance worse on NUMA systems and is what I'll try fixing today.
Ok this actually seems like great news. I'm pretty confident I know what's going on and am hopeful that we can get the best of both worlds! I'll make an experiment later today and make a branch of async-std so we can compare it to 1.6.2 and master.
I totally agree :) again, this wasn't meant as a complaint at all. We are mostly on the left side (UMA + few cores) anyway in production so the speed boost here is absolutely great!
I just have the thread ripper to work/run benchmarks on and figured that change in behavior is something worth sharing as if nothing else it's interesting and perhaps not obvious.
BTW great work on the changes to master this comes really really close to thread performance <3.
as a side note: master is within 10-15% of what we used to get with ideally laid out threaded code :) (i.e. 3 threads 3 cores, each working over a crossbeam::channel doing roughly the same amount of work) this is really amazing!
@Licenser Could you perhaps try this branch? https://github.com/stjepang/async-std/tree/experiment
Sorry for the late reply the tests take a while :)
I re-run the 1.6.2 branch for a fair comparison since I did some CPU fiddling yesterday evening and wanted to make sure it's a fair comparison.
The general bump in throughput is due to (the aforementioned CPU fiddling) but the delta between the 48 thread variants looks slightly lower
1.6.2 (throughput in MB/s)
| # tasks/2 | 3 cores | 48 threads |
|---|---|---|
| 1 | 133.0 | 111.4 |
| 2 | 313.1 | 365.8 |
| 4 | 331.9 | 374.2 |
| 8 | 300.3 | 357.2 |
| 16 | 287.6 | 341.5 |
| 32 | 279.4 | 327.8 |
| 64 | 267.1 | 330.5 |
experiment (throughput in MB/s)
| # tasks/2 | 3 cores | 48 threads |
|---|---|---|
| 1 | 381.2 | 138.6 |
| 2 | 421.3 | 324.6 |
| 4 | 378.8 | 253.4 |
| 8 | 353.9 | 244.6 |
| 16 | 324.7 | 239.8 |
| 32 | 315.3 | 233.4 |
| 64 | 319.7 | 227.1 |
ratio 1.6.2 / candidate
| # tasks / 2 | master | experiment |
|---|---|---|
| 1 | 0,6602 | 0,8038 |
| 2 | 1,1553 | 1,1269 |
| 4 | 1,4231 | 1,4767 |
| 8 | 1,5526 | 1,4603 |
| 16 | 1,5183 | 1,4241 |
| 32 | 1,505 | 1,4045 |
| 64 | 1,5491 | 1,4553 |
perf
for the completeness I added a perf recording of the 64 / 48 run
1,78% rustc libLLVM-10-rust-1.45.0-stable.so [.] llvm::PointerMayBeCaptured(llvm::Value const*, llvm::CaptureTracker*, unsigned int)::$_0::operator() ◆
1,67% rustc rustc [.] free ▒
1,63% rustc libLLVM-10-rust-1.45.0-stable.so [.] llvm::Linker::linkInModule ▒
1,19% rustc rustc [.] malloc ▒
1,04% rustc libLLVM-10-rust-1.45.0-stable.so [.] combineInstructionsOverFunction ▒
0,93% rustc libLLVM-10-rust-1.45.0-stable.so [.] llvm::PointerMayBeCaptured ▒
0,71% rustc libLLVM-10-rust-1.45.0-stable.so [.] llvm::removeUnreachableBlocks ▒
0,71% rustc libLLVM-10-rust-1.45.0-stable.so [.] llvm::SmallPtrSetImplBase::insert_imp_big ▒
0,69% rustc libLLVM-10-rust-1.45.0-stable.so [.] computeKnownBitsFromOperator ▒
0,66% rustc libLLVM-10-rust-1.45.0-stable.so [.] llvm::ValueHandleBase::AddToUseList ▒
0,65% async-std/runti tremor-server [.] rust_alloc ▒
0,60% rustc libLLVM-10-rust-1.45.0-stable.so [.] llvm::Instruction::getMetadataImpl ▒
0,54% rustc libc-2.31.so [.] __memcmp_avx2_movbe ▒
0,48% rustc libLLVM-10-rust-1.45.0-stable.so [.] llvm::DILocation::getImpl ▒
0,44% rustc libLLVM-10-rust-1.45.0-stable.so [.] llvm::SelectionDAG::Combine ▒
0,43% rustc libLLVM-10-rust-1.45.0-stable.so [.] llvm::MetadataTracking::track ▒
0,42% rustc libLLVM-10-rust-1.45.0-stable.so [.] llvm::MetadataTracking::untrack ▒
0,40% async-std/runti tremor-server [.] simd_json::stage2::<impl simd_json::Deserializer>::build_tape ▒
0,39% rustc libLLVM-10-rust-1.45.0-stable.so [.] (anonymous namespace)::CallAnalyzer::analyze ▒
0,39% rustc libLLVM-10-rust-1.45.0-stable.so [.] llvm::DataLayout::getTypeSizeInBits ▒
0,38% async-std/runti [kernel.kallsyms] [k] psi_task_change ▒
0,37% rustc libLLVM-10-rust-1.45.0-stable.so [.] llvm::PMTopLevelManager::findAnalysisPass ▒
0,36% rustc libLLVM-10-rust-1.45.0-stable.so [.] llvm::CallBase::onlyReadsMemory ▒
0,36% rustc libLLVM-10-rust-1.45.0-stable.so [.] llvm::InlineFunction ▒
0,35% async-std/runti libc-2.31.so [.] __memmove_avx_unaligned_erms ▒
0,35% rustc libLLVM-10-rust-1.45.0-stable.so [.] llvm::PMDataManager::removeNotPreservedAnalysis ▒
0,34% rustc libLLVM-10-rust-1.45.0-stable.so [.] eliminateDeadStores ▒
0,34% rustc libLLVM-10-rust-1.45.0-stable.so [.] (anonymous namespace)::LiveDebugValues::ExtendRanges ▒
0,33% async-std/runti tremor-server [.] simd_json::value::borrowed::serialize::FastGenerator::write_json ▒
0,31% rustc libLLVM-10-rust-1.45.0-stable.so [.] computeKnownBits ▒
0,31% rustc libLLVM-10-rust-1.45.0-stable.so [.] llvm::wouldInstructionBeTriviallyDead ▒
0,30% async-std/runti [kernel.kallsyms] [k] delay_mwaitx ▒
0,29% rustc libc-2.31.so [.] __memmove_sse2_unaligned_erms ```
Thank you for such a detailed report! Two more questions:
- When testing with async-std v1.6.2, what is the smol version used (what version appears in
Cargo.lockorcargo tree)? - Where is this benchmark from? I'd like to run it on my own to investigate further...
No worries, I appreciate the help! smol and async-std are incredible :D
- cargo tree satys (from the top level it's imported again):
├── async-std v1.6.2
│ └── smol v0.1.18
├── smol v0.3.3
- https://github.com/wayfair-tremor/tremor-runtime/tree/source-trait
you can compile it with
cargo build -p tremor-server --releaseand then runsh bench/scale.shto run the stack of tests. It usestaskset -c 0,1,2in there (you might need to adjust that to fit your system)
Do you know how to fix this error?
tremor-runtime source-trait $ target/release/tremor-server --no-api -c ./bench/real-workflow-throughput-json-8-8-1.yaml bench/link.yaml
tremor version: 0.9.0
rd_kafka version: 0x000002ff, 1.4.2
[2020-08-04T14:25:57Z ERROR tremor_pipeline::op::runtime::tremor] Error:
1 | use std::array;
| Module `std/array` not found or not readable error in module path:
2 | match event.application of
3 | case "app1" => let $class = "applog_app1", let $rate = 1250, let $dimension = event.application, emit event
[2020-08-04T14:25:57Z ERROR tremor_server] error: Module `std/array` not found or not readable error in module path:
error: Module `std/array` not found or not readable error in module path:
sorry about that, you got to tell tremor where it's standard lib is. this should work:
export TREMOR_PATH=./bench/../tremor-script/lib
Thank you, that worked! :) I haven't made much progress, however, because compilation is really really slow. Do you think it would be possible to make a benchmark that does roughly the same thing, but in a simpler program that is quicker to compile?
I'll give it a try tomorrow morning, a simple deserialize on one side, run some script in the middle serialize on the other might be good enough (famous last words) I'll see how that goes, it lacks a bit of fidelity but perhaps it models it well enough
Heya I added a little (very simplified) test:
cargo build -p runtime-bench --release && strip target/release/runtime-bench
for i in 1 2 4 8 16 32 64 128
do
taskset -c 0,1,2 ./target/release/runtime-bench $i
./target/release/runtime-bench $i
done
can run it and with a bit of text editing it spits out some throughput measurements.
It's not a true replication of the original benchmark it leaves out the whole return channel shenanigans we're doing, as well as the mini interpreter we run against each event. The differences are not as big but I think we can observe the same pattern: in 1.6.2 low core count/shared cache are slower but high core count / no shared cache is faster compared to experiment.
~I'm unsure why it says 0 for a concurrency of 128 the code shouldn't be locking up.~ (this was thanks to the wonders of non-monotinic time <3, I updated the numbers)
experiment
| concurrency | 3c | 48c |
|---|---|---|
| 1 | 382.3 MB/s | 244.1 MB/s |
| 2 | 486.5 MB/s | 238.6 MB/s |
| 4 | 425.3 MB/s | 199.0 MB/s |
| 8 | 312.6 MB/s | 205.3 MB/s |
| 16 | 282.6 MB/s | 201.6 MB/s |
| 32 | 452.8 MB/s | 193.4 MB/s |
| 64 | 433.4 MB/s | 186.3 MB/s |
| 128 | 398.4 MB/s | 182.9 MB/s |
1.6.2
| concurrency | 3c | 48c |
|---|---|---|
| 1 | 320.2 MB/s | 198.6 MB/s |
| 2 | 275.0 MB/s | 240.3 MB/s |
| 4 | 342.5 MB/s | 262.0 MB/s |
| 8 | 381.5 MB/s | 274.2 MB/s |
| 16 | 376.2 MB/s | 289.4 MB/s |
| 32 | 371.6 MB/s | 330.1 MB/s |
| 64 | 369.3 MB/s | 394.7 MB/s |
| 128 | 381.0 MB/s | 396.2 MB/s |
-- edit below
threads
For the sake of it I translated the benchmark to threads as well (same repo thread-bench)
| concurrency | 3c | 48c |
|---|---|---|
| 1 | 341.4 MB/s | 146.0 MB/s |
| 2 | 416.6 MB/s | 296.2 MB/s |
| 4 | 284.0 MB/s | 269.9 MB/s |
| 8 | 143.7 MB/s | 203.6 MB/s |
| 16 | 46.2 MB/s | 190.9 MB/s |
| 32 | 17.4 MB/s | 184.6 MB/s |
| 64 | 4.0 MB/s | 164.4 MB/s |
| 128 | 0.8 MB/s | 153.6 MB/s |
code is here: https://github.com/wayfair-tremor/tremor-runtime/tree/source-trait/thread-bench https://github.com/wayfair-tremor/tremor-runtime/tree/source-trait/runtime-bench
Since it ended up not needing any dependencies on tremor specific code I moved this out to it's own repo (less cloning / .lock stuff ): https://github.com/wayfair-tremor/rt-bench
Some thoughts... Now that async-executor is stable (docs), I wonder if it would be more worthwhile to detect NUMA nodes and then create one executor and a threadpool per NUMA node. What do you think, have you perhaps tried that? It may also make sense for async-std to figure out the NUMA architecture on the system and then create a threadpool per node.
Looking at the numbers again, it seems that 3 cores are always faster than 48 cores :) So perhaps being less efficient makes you faster at higher core counts (ironically). I don't have proof for this hypothesis, but it might be why v1.6.2 is winning at 48c benchmarks.
I've been thinking about that too, but I'm not super sure how to best approach that.
I think to automate that tasks would need to "understand" how they are related. i.e. if task-a and task-b communicate they probably want to share a NUMA-executor, if they don't they probably don't want to.
It would be super cool to be able to describe those relations but to be fair I'm not sure if the overhead of that would be slowing it down (due to extra logic etc.).
The whole thing gets a bit more complicated with systems like Ryzen where NUMA isn't enough to explain the architecture. Even so a CCD exposes itself as a single NUMA node the two CCX's don't share all the caches so having a task move between CCX's inside a CCD would still cause a slowdown. I found no other way to determine that other then measuring distance with crossbeam-channels (see: https://github.com/Licenser/trb) so if there are NUMA aware executors perhaps making the grouping of cores pluggable would be nice. That way it would be possible to write discoverers to create groupings.
I'm curious: when deploying this to production, do you use all available cores or just a small number of them (which I would expect to deliver the best performance)?
Some interesting reading:
- NUMA-aware scheduler for Go
- QuickThread is a good example of NUMA-aware scheduler for C++ [...] But it's based on user hints all way down.
- C++ vs Rust: an async Thread-per-Core story
EDIT: fixed the second link
The answer to that is tricky, so the answer to deployment will be a bit longer ;)
tremor.rs (where we use smol/async-std) is open source so we often have no control over the deployment. Internally at Wayfair, we currently deploy mostly on VMs with small core numbers (2c4t) so they will be on the same NUMA node. (That deployment strategy isn't ideal for everyone.)
We recently moved from threads to async (some writeup on that https://www.tremor.rs/blog/2020-08-06-to-async-or-not-to-async/ ) with the goal to include parallelism where possible, i.e. instead of having 3 threads on a system we might have 30 tasks. Naive thread implementations don't scale well, async and async-std/smol do a lot of heavy lifting (yay thank you! <3).
For benchmarks, we run on both Intel and AMD systems to get a feeling of how the software behaves there, given the characteristics of the platform are different enough now that simply optimizing for one might bite you on the other. And the differences are quite interesting especially since unlike NUMA it's not easily / automatically discoverable.
I've read the 3rd article, can't access the 2nd (thanks google!) and the 1st one is new :) I'll go take a read.
cc @mfelsche