12.5% performance increase without the `Arc<Mutex<...>>` lock of the database
In my testing branch human-readable-perf, I replaced the current Arc<Mutex<Connection>> pattern with a background thread receiving a command and a oneshot channel, to send result back
And the replacement will increase average performance of insertion by 12.5%, see report-.zip generated by wastebin-bench.
TL;DR: from ~40k QPS to ~45k QPS, benched on
i7-12700Hrunning archlinux (mainline kernel)
cargo r -r -- --host http://127.0.0.1:8088 --run-time 30
The main cost is readability of the code (I did not find a design not such ugly for now)
As it's just for performance test, only insert is implemented in the -perf branch.
Alternatively, we may also replace self.conn to a thread_local! storaged Connection (Obviously we need not multiple database instances inside one wastebin server instance). Thus, mutex is no longer needed, and the code could stay clean. Performance should be similar with the flume based solution.
Can you re-run the test on a single-core machine/VM/container? Just to see the impact of how tokio's background threads deal in such situation compared to a dedicated one.
Can you re-run the test on a single-core machine/VM/container? Just to see the impact of how tokio's background threads deal in such situation compared to a dedicated one.
I will do so when I am avaliable
BTW I found a command could limit the process to run with single CPU core
numactl --physcpubind=+0 ./target/release/wastebin
Can you re-run the test on a single-core machine/VM/container? Just to see the impact of how tokio's background threads deal in such situation compared to a dedicated one.
In the single-core scene, there is no difference between the kanel version and the spawn_blocking version; 20472 vs 20770 QPS, latter is of the spawn_blocking version In multi-core scene (14c20t), kanel version is 11% faster: 45458 QPS vs 50274.89 QPS
Thanks to @qaqland, with a generic call() method we could have performance gain without big refactor work
Thanks to @qaqland, with a generic call() method we could have performance gain without big refactor work
Is it also in your branch? Sorry for being late, this week has been a bit scarce regarding time.
Thanks to @qaqland, with a generic call() method we could have performance gain without big refactor work
Is it also in your branch? Sorry for being late, this week has been a bit scarce regarding time.
Hi! Sorry for slow response
Yeah, I have implemented it in https://github.com/mokurin000/wastebin/tree/human-readable-perf-kanal
I tried to cherry-pick the performance patch, but there are too many conflicts, it's now based on the human-readable branch
Can you try this branch and check the results? It's a similar design but uses a bog standard mpsc channel. On my system I see even better improvements of around 45% rather than 12.5%.
Can you try this branch and check the results? It's a similar design but uses a bog standard
mpscchannel. On my system I see even better improvements of around 45% rather than 12.5%.
nice work!
my kanal implementation sends boxed closures, which is more expensive than sending commands, but requires less work.
we may also try to replace tokio mpsc channel with kanal ones? That is the fastest channel currently
Strange... on my i7-12700H, running on Arch 6.14.6
| branch | RPS |
|---|---|
| master | 46191.89 |
| kanel | 48767.67 |
| tokio-mpsc | 50692.79 |
Is that due to CPU difference? For benchmarking, my parameters are 5 secs warmup and 30 secs bench
Is that due to CPU difference?
Perhaps. I get wildly different results on an i7-13700H (20 threads): 16209.17 (master) vs 24695.38 (mpsc) vs 33287.72 (kanal). And differences become smaller for smaller user numbers. So yeah, I will go for kanal even though it's yet another dependency :-/
Another difference is that my implementation runs the database handler in a spawn_blocking thread that is managed by tokio and both server listener and database handler futures are scheduled with futures_concurrency's join() rather than spawning a tokio task.
One last thing: these huge numbers are of course only possible with an in-memory database. These changes do not do much when the disk is hammered with writes. But in any case, a lot more reads than writes is probably the norm for a pastebin.