fast-sqlite3-inserts icon indicating copy to clipboard operation
fast-sqlite3-inserts copied to clipboard

Statically linked with libsqlite3.a with LTO enabled

Open NobodyXu opened this issue 3 years ago • 13 comments

This PR enables binaries who uses rusqlite to statically linked with libsqlite3.a compiled with LTO using linker-plugin-lto.

To compile these binaries (excluding basic_async.rs), just run make -j $(nproc). It will compile sqlite3.c using CFLAGS='-O2 -flto'. The generated binaries will be smaller, ~~though I haven't tested the performance yet, I will add the benchmark below as a comment~~ ~~but only provides minor performance improvements (see comments below)~~ It seems that I didn't enable LTO in rust (see comments below).

To compile basic_async.rs, run cargo build --release --bin basic_async --features async-sql.

This PR might be related to #14

Signed-off-by: Jiahao XU [email protected]

NobodyXu avatar Jul 27 '21 04:07 NobodyXu

This benchmark result is outdated

Tue Jul 27 14:30:28 AEST 2021 [RUST] basic (100_000_000) inserts

real    2m46.459s
user    2m45.606s
sys     0m0.840s
Tue Jul 27 14:33:15 AEST 2021 [RUST] basic_batched (100_000_000) inserts

real    0m24.468s
user    0m23.676s
sys     0m0.790s
Tue Jul 27 14:33:39 AEST 2021 [RUST] basic_batched_wp (100_000_000) inserts

real    2m16.578s
user    2m15.077s
sys     0m1.500s
Tue Jul 27 14:35:56 AEST 2021 [RUST] basic_prep (100_000_000) inserts

real    0m57.312s
user    0m56.442s
sys     0m0.860s
Tue Jul 27 14:36:53 AEST 2021 [RUST] threaded_batched (100_000_000) inserts

real    0m24.522s
user    0m32.507s
sys     0m5.327s
Tue Jul 27 14:37:18 AEST 2021 [RUST] threaded_busy (100_000_000) inserts

real    0m1.614s
user    0m15.816s
sys     0m0.929s
Tue Jul 27 14:37:20 AEST 2021 [RUST] threaded_str_batched (100_000_000) inserts

real    2m13.930s
user    2m28.820s
sys     0m2.238s

It seems that linking with sqlite3 with LTO does provides some minor benefit.

NobodyXu avatar Jul 27 '21 04:07 NobodyXu

I looked at the assembly and found that not all sqlite3_* functions are inlined.

I think this might have something to do with the default profile.release in cargo:

lto = false
panic = 'unwind'

Since lto is disabled by default, this might explain why there isn't much improvement at all.

And, panic = "unwind" also have affect on performance.

NobodyXu avatar Jul 27 '21 05:07 NobodyXu

The latest commit enables lto, set panic to unwind and codegen-units to 1.

However, I currently don't have any time to benchmark this new commit.

NobodyXu avatar Jul 27 '21 05:07 NobodyXu

Here's the up-to-date benchmark:

Tue Jul 27 20:49:47 AEST 2021 [RUST] basic (100_000_000) inserts

real    2m49.296s
user    2m48.535s
sys     0m0.750s
Tue Jul 27 20:52:37 AEST 2021 [RUST] basic_batched (100_000_000) inserts

real    0m22.040s
user    0m21.170s
sys     0m0.870s
Tue Jul 27 20:52:59 AEST 2021 [RUST] basic_batched_wp (100_000_000) inserts

real    2m14.802s
user    2m13.190s
sys     0m1.610s
Tue Jul 27 20:55:14 AEST 2021 [RUST] basic_prep (100_000_000) inserts

real    0m52.067s
user    0m51.294s
sys     0m0.770s
Tue Jul 27 20:56:06 AEST 2021 [RUST] busy (100_000_000) inserts

real    0m6.143s
user    0m5.922s
sys     0m0.220s
Tue Jul 27 20:56:12 AEST 2021 [RUST] threaded_batched (100_000_000) inserts

real    0m21.247s
user    0m27.526s
sys     0m5.566s
Tue Jul 27 20:56:33 AEST 2021 [RUST] threaded_busy (100_000_000) inserts

real    0m1.219s
user    0m11.483s
sys     0m0.656s
Tue Jul 27 20:56:35 AEST 2021 [RUST] threaded_str_batched (100_000_000) inserts

real    2m18.352s
user    2m30.856s
sys     0m2.062s

However, after investigation, I still found many sqlite3_* symbols in the generated binary, which means the functions mostly aren't inlined.

NobodyXu avatar Jul 27 '21 11:07 NobodyXu

~~Strange thing is, when I look into the disassembly using objdump -d, I found that in the main function, there is no call to sqlite3_* function.~~

I think I have been mistaken, as the main is provided by libstd, as a wrapper to the actual main writen in rust.

NobodyXu avatar Jul 27 '21 11:07 NobodyXu

By doing the linker-plugin-lto in the workspace, it seems that the cross-language LTO finally worked.

Here's the benchmark:

Wed Aug  4 11:33:50 IST 2021 [PYTHON] running basic (10_000_000) inserts

real    2m40.541s
user    2m39.687s
sys     0m0.850s
Wed Aug  4 11:36:31 IST 2021 [PYTHON] running basic_batched (10_000_000) inserts

real    0m22.865s
user    0m22.014s
sys     0m0.850s
Wed Aug  4 11:36:54 IST 2021 [PYTHON] running basic_batched_wp (10_000_000) inserts

real    2m19.619s
user    2m18.259s
sys     0m1.360s
Wed Aug  4 11:39:13 IST 2021 [PYTHON] running basic_prep (10_000_000) inserts

real    0m53.305s
user    0m52.571s
sys     0m0.730s
Wed Aug  4 11:40:07 IST 2021 [PYTHON] running busy (10_000_000) inserts

real    0m5.994s
user    0m5.622s
sys     0m0.350s
Wed Aug  4 11:40:13 IST 2021 [PYTHON] running threaded_batched (10_000_000) inserts

real    0m21.945s
user    0m29.315s
sys     0m5.388s
Wed Aug  4 11:40:35 IST 2021 [PYTHON] running threaded_str_batched (10_000_000) inserts

real    2m15.994s
user    2m30.199s
sys     0m2.240s

There isn't much improvments, so I will use framegraph to profile the binaries.

NobodyXu avatar Aug 04 '21 06:08 NobodyXu

Here are the flamegraphs for the rust binaries.

NobodyXu avatar Aug 04 '21 06:08 NobodyXu

basic flamegraph

NobodyXu avatar Aug 04 '21 07:08 NobodyXu

basic_batched flamegraph

NobodyXu avatar Aug 04 '21 07:08 NobodyXu

basic_batched_wp flamegraph

NobodyXu avatar Aug 04 '21 07:08 NobodyXu

basic_prep flamegraph

NobodyXu avatar Aug 04 '21 07:08 NobodyXu

threaded_batched flamegraph

NobodyXu avatar Aug 04 '21 07:08 NobodyXu

threaded_str_batched flamegraph

NobodyXu avatar Aug 04 '21 07:08 NobodyXu