[lmdb #3] Add LMDB storage backend
This commit adds an LMDB-based storage backend.
Why?
- It's faster than RocksDB (see below)
- It has no dependencies (all the code is bundled, just a couple of
.cfiles). - No slow compaction stage during DB creation
- Backup is just copying the DB, so restore can be instantaneous once the DB is downloaded.
- Size of the on-disk DB may actually be smaller than RocksDB, but needs more experiments
How do you use it?
Adding --lmdb to the glean or glean-server command line tells it to use LMDB for any DBs created. If this option is given, Glean will still be able to work with existing RocksDB DBs, including restoring them from remote storage. (and vice-versa, if --lmdb is not given, then restoring LMDB DBs from remote storage is still supported).
How does it perform?
Testing indexing of Stackage using the Haskell indexer.
** index
lmdb: 189s, 2.6GB
rocksdb: 216s, 0.9GB
** backup:
lmdb: 5.7s, 0.8GB
rocksdb: 1.6s 0.9GB
** restore:
lmdb: 2.6s 2.1GB
rocksdb: 2.15s 0.9GB
Note that backup with LMDB just copies the DB and compresses it, most of the time is spent in compression, which can be tuned to be either faster / less compression or slower / more compression. Currently using zstd level 8 which seems a reasonable compromise.
Also note that backup produces a squashfs which can be mounted directly (instantaneous restore with decompression on the fly). See #669
C++ Indexing
I indexed all of LLVM + Clang. This is the time to write the data to the DB and do deriving (since otherwise the indexing time is dominated by the C++ indexer itself):
- lmdb: 30.8s 754MB
- rocksdb: 52.4s 256MB
Backup:
- lmdb: 1.8s 213MB
- rocksdb: 0.8s 252MB
Restore:
- lmdb: 1.14s 213MB (using
db_lmdb_restore_unpack = false, see #669) - rocksdb: 1.14s 252MB
Benchmarking with query-bench
* RocksDB (9 Jan)
benchmarking complex-parallel
time 4.722 s (4.188 s .. 5.182 s)
0.998 R² (0.998 R² .. 1.000 R²)
mean 5.051 s (4.859 s .. 5.403 s)
std dev 341.9 ms (16.30 ms .. 415.4 ms)
variance introduced by outliers: 20% (moderately inflated)
benchmarking simple
time 35.87 μs (35.32 μs .. 36.95 μs)
0.989 R² (0.978 R² .. 0.995 R²)
mean 41.35 μs (39.91 μs .. 42.80 μs)
std dev 5.186 μs (4.647 μs .. 6.200 μs)
variance introduced by outliers: 89% (severely inflated)
benchmarking nested
time 55.91 μs (55.55 μs .. 56.57 μs)
0.993 R² (0.981 R² .. 1.000 R²)
mean 57.59 μs (56.00 μs .. 61.65 μs)
std dev 7.433 μs (1.047 μs .. 13.81 μs)
variance introduced by outliers: 89% (severely inflated)
benchmarking page
time 21.88 ms (20.42 ms .. 23.41 ms)
0.952 R² (0.876 R² .. 0.991 R²)
mean 22.14 ms (21.08 ms .. 24.16 ms)
std dev 3.260 ms (1.508 ms .. 5.239 ms)
variance introduced by outliers: 63% (severely inflated)
benchmarking pagenested
time 43.15 ms (41.51 ms .. 45.74 ms)
0.991 R² (0.975 R² .. 1.000 R²)
mean 42.00 ms (41.28 ms .. 43.37 ms)
std dev 1.910 ms (749.0 μs .. 2.976 ms)
variance introduced by outliers: 13% (moderately inflated)
benchmarking compile
time 328.2 μs (326.4 μs .. 330.4 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 327.3 μs (326.6 μs .. 328.4 μs)
std dev 2.961 μs (2.361 μs .. 4.059 μs)
benchmarking compile2
time 95.93 μs (95.46 μs .. 96.41 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 96.06 μs (95.72 μs .. 96.69 μs)
std dev 1.519 μs (865.7 ns .. 2.440 μs)
benchmarking compile3
time 48.67 μs (48.44 μs .. 48.90 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 48.87 μs (48.71 μs .. 49.12 μs)
std dev 670.7 ns (545.1 ns .. 856.4 ns)
benchmarking array_prefix
time 306.2 ms (274.9 ms .. 337.6 ms)
0.997 R² (0.987 R² .. 1.000 R²)
mean 310.7 ms (306.6 ms .. 321.7 ms)
std dev 8.283 ms (233.8 μs .. 10.44 ms)
variance introduced by outliers: 16% (moderately inflated)
benchmarking stacked/page
time 23.02 ms (21.20 ms .. 24.57 ms)
0.977 R² (0.956 R² .. 0.992 R²)
mean 21.80 ms (20.96 ms .. 22.68 ms)
std dev 1.934 ms (1.618 ms .. 2.536 ms)
variance introduced by outliers: 38% (moderately inflated)
benchmarking stacked/pagenested
time 41.75 ms (40.69 ms .. 43.81 ms)
0.995 R² (0.988 R² .. 0.999 R²)
mean 41.74 ms (41.10 ms .. 42.64 ms)
std dev 1.536 ms (1.034 ms .. 1.913 ms)
* LMDB (9 Jan)
benchmarking complex-parallel
time 3.060 s (2.493 s .. 3.351 s)
0.996 R² (0.991 R² .. 1.000 R²)
mean 3.680 s (3.401 s .. 4.208 s)
std dev 516.3 ms (6.608 ms .. 613.0 ms)
variance introduced by outliers: 24% (moderately inflated)
benchmarking simple
time 31.78 μs (31.26 μs .. 32.59 μs)
0.994 R² (0.992 R² .. 0.996 R²)
mean 35.66 μs (34.61 μs .. 36.79 μs)
std dev 3.901 μs (3.329 μs .. 4.924 μs)
variance introduced by outliers: 86% (severely inflated)
benchmarking nested
time 49.22 μs (48.66 μs .. 50.16 μs)
0.996 R² (0.993 R² .. 0.998 R²)
mean 50.49 μs (49.64 μs .. 51.86 μs)
std dev 3.597 μs (2.639 μs .. 4.710 μs)
variance introduced by outliers: 71% (severely inflated)
benchmarking page
time 15.30 ms (14.10 ms .. 16.63 ms)
0.963 R² (0.933 R² .. 0.983 R²)
mean 15.70 ms (15.11 ms .. 16.45 ms)
std dev 1.653 ms (1.265 ms .. 2.442 ms)
variance introduced by outliers: 52% (severely inflated)
benchmarking pagenested
time 24.94 ms (22.29 ms .. 27.82 ms)
0.947 R² (0.903 R² .. 0.981 R²)
mean 22.95 ms (21.91 ms .. 24.36 ms)
std dev 2.825 ms (2.143 ms .. 3.701 ms)
variance introduced by outliers: 53% (severely inflated)
benchmarking compile
time 349.6 μs (339.6 μs .. 363.0 μs)
0.991 R² (0.987 R² .. 0.995 R²)
mean 350.7 μs (342.1 μs .. 360.2 μs)
std dev 29.73 μs (25.58 μs .. 36.64 μs)
variance introduced by outliers: 72% (severely inflated)
benchmarking compile2
time 97.12 μs (95.42 μs .. 100.3 μs)
0.989 R² (0.979 R² .. 0.998 R²)
mean 101.6 μs (98.86 μs .. 106.6 μs)
std dev 12.01 μs (7.345 μs .. 21.75 μs)
variance introduced by outliers: 86% (severely inflated)
benchmarking compile3
time 48.31 μs (48.11 μs .. 48.53 μs)
0.999 R² (0.998 R² .. 1.000 R²)
mean 48.79 μs (48.34 μs .. 50.48 μs)
std dev 2.691 μs (517.0 ns .. 5.640 μs)
variance introduced by outliers: 60% (severely inflated)
benchmarking array_prefix
time 311.7 ms (296.2 ms .. 346.3 ms)
0.996 R² (0.975 R² .. 1.000 R²)
mean 308.2 ms (303.7 ms .. 319.8 ms)
std dev 8.794 ms (1.644 ms .. 11.58 ms)
variance introduced by outliers: 16% (moderately inflated)
benchmarking stacked/page
time 19.68 ms (17.41 ms .. 22.57 ms)
0.929 R² (0.884 R² .. 0.970 R²)
mean 18.42 ms (17.20 ms .. 19.85 ms)
std dev 3.087 ms (2.367 ms .. 4.283 ms)
variance introduced by outliers: 71% (severely inflated)
benchmarking stacked/pagenested
time 21.98 ms (20.72 ms .. 23.49 ms)
0.982 R² (0.965 R² .. 0.991 R²)
mean 23.61 ms (22.42 ms .. 25.52 ms)
std dev 3.453 ms (2.212 ms .. 5.199 ms)
variance introduced by outliers: 63% (severely inflated)
Glass
glass-democlient list on 1000 files in the Stackage DB.
rocksdb: 39s lmdb: 31s
Caveats
The on-disk size of the DBs is larger than RocksDB by about 2x, although the backed-up size is about the same. This is mainly because LMDB doesn't do compression itself, however backup includes compression and the backup can be mounted directly as a squashfs (although mounting isn't implemented yet - we need to think about what happens if the server is restarted). I ran the Glass benchmark with the DB mounted as a squashfs and it was 32s vs. 31s, so hardly any difference.
There's no way to control the cache size, LMDB just maps pages into memory and relies on the OS to manage memory usage. It remains to be seen how well this works for a production server with many large DBs.
Key storage
LMDB doesn't support large keys, so I've used a strategy in which only the first ~2KB of the key is indexed. Searching for a key using a prefix larger than this may involve a linear search, but I think this is reasonable: we should never be looking up a fact using a large key. We can document this as a performance gotcha, just like sorting of fields. Indeed, we could use this same strategy with RocksDB to avoid duplicating large keys in the DB, which could potentially save a lot of space.