wastebin icon indicating copy to clipboard operation
wastebin copied to clipboard

12.5% performance increase without the `Arc<Mutex<...>>` lock of the database

Open mokurin000 opened this issue 8 months ago • 6 comments

In my testing branch human-readable-perf, I replaced the current Arc<Mutex<Connection>> pattern with a background thread receiving a command and a oneshot channel, to send result back

And the replacement will increase average performance of insertion by 12.5%, see report-.zip generated by wastebin-bench.

TL;DR: from ~40k QPS to ~45k QPS, benched on i7-12700H running archlinux (mainline kernel)

cargo r -r -- --host http://127.0.0.1:8088 --run-time 30

The main cost is readability of the code (I did not find a design not such ugly for now)

As it's just for performance test, only insert is implemented in the -perf branch.

mokurin000 avatar Apr 06 '25 15:04 mokurin000

Alternatively, we may also replace self.conn to a thread_local! storaged Connection (Obviously we need not multiple database instances inside one wastebin server instance). Thus, mutex is no longer needed, and the code could stay clean. Performance should be similar with the flume based solution.

mokurin000 avatar Apr 06 '25 16:04 mokurin000

Can you re-run the test on a single-core machine/VM/container? Just to see the impact of how tokio's background threads deal in such situation compared to a dedicated one.

matze avatar Apr 06 '25 17:04 matze

Can you re-run the test on a single-core machine/VM/container? Just to see the impact of how tokio's background threads deal in such situation compared to a dedicated one.

I will do so when I am avaliable

BTW I found a command could limit the process to run with single CPU core

numactl --physcpubind=+0 ./target/release/wastebin

mokurin000 avatar Apr 06 '25 17:04 mokurin000

Can you re-run the test on a single-core machine/VM/container? Just to see the impact of how tokio's background threads deal in such situation compared to a dedicated one.

In the single-core scene, there is no difference between the kanel version and the spawn_blocking version; 20472 vs 20770 QPS, latter is of the spawn_blocking version In multi-core scene (14c20t), kanel version is 11% faster: 45458 QPS vs 50274.89 QPS

Thanks to @qaqland, with a generic call() method we could have performance gain without big refactor work

mokurin000 avatar Apr 07 '25 17:04 mokurin000

Thanks to @qaqland, with a generic call() method we could have performance gain without big refactor work

Is it also in your branch? Sorry for being late, this week has been a bit scarce regarding time.

matze avatar Apr 12 '25 11:04 matze

Thanks to @qaqland, with a generic call() method we could have performance gain without big refactor work

Is it also in your branch? Sorry for being late, this week has been a bit scarce regarding time.

Hi! Sorry for slow response

Yeah, I have implemented it in https://github.com/mokurin000/wastebin/tree/human-readable-perf-kanal

I tried to cherry-pick the performance patch, but there are too many conflicts, it's now based on the human-readable branch

mokurin000 avatar Apr 14 '25 06:04 mokurin000

Can you try this branch and check the results? It's a similar design but uses a bog standard mpsc channel. On my system I see even better improvements of around 45% rather than 12.5%.

matze avatar May 17 '25 20:05 matze

Can you try this branch and check the results? It's a similar design but uses a bog standard mpsc channel. On my system I see even better improvements of around 45% rather than 12.5%.

nice work!

my kanal implementation sends boxed closures, which is more expensive than sending commands, but requires less work.

we may also try to replace tokio mpsc channel with kanal ones? That is the fastest channel currently

mokurin000 avatar May 18 '25 02:05 mokurin000

Strange... on my i7-12700H, running on Arch 6.14.6

branch RPS
master 46191.89
kanel 48767.67
tokio-mpsc 50692.79

Is that due to CPU difference? For benchmarking, my parameters are 5 secs warmup and 30 secs bench

mokurin000 avatar May 18 '25 04:05 mokurin000

Is that due to CPU difference?

Perhaps. I get wildly different results on an i7-13700H (20 threads): 16209.17 (master) vs 24695.38 (mpsc) vs 33287.72 (kanal). And differences become smaller for smaller user numbers. So yeah, I will go for kanal even though it's yet another dependency :-/

Another difference is that my implementation runs the database handler in a spawn_blocking thread that is managed by tokio and both server listener and database handler futures are scheduled with futures_concurrency's join() rather than spawning a tokio task.

matze avatar May 18 '25 09:05 matze

One last thing: these huge numbers are of course only possible with an in-memory database. These changes do not do much when the disk is hammered with writes. But in any case, a lot more reads than writes is probably the norm for a pastebin.

matze avatar May 18 '25 13:05 matze