Performance degradation ~18-32GB (long running large DB, unreachable unfreed, delete)

Open chuckg opened this issue 8 years ago • 0 comments

We're experiencing large write performance degradation over time as a result of sustained IO/throughput happening in BoltDB. But first, a little about our use case:

We're writing a sequential log of data, wherein the majority of our keys are 64 byte integers that primarily increase. One bucket serves to hold a variable payload size (change stream from Postgres).
We clean up old log data that has passed some expiration. In all the stats below, we're using a 48 hour retention period and clean up hourly. Deletes are done in batches of 1000 using manual transaction management.

We currently have 3 buckets:

#1 Bucket(64 byte variable key) -> Bucket(64 byte integer key, ascending insertion order generally) -> N bytes (1KB-128KB; think database change log of old row/new row)
#2 Bucket(64 byte integer key, ascending insertion order generally) -> 64 byte (max) random value
#3 Bucket(64 byte integer key, ascending insertion order generally) -> 64 byte integer value

All use 50% (default) fill.

After running the server for 4-5 days, we ended up with a 32GB database that has pegged the i3.xl we're running it on at ~3-4K IOPS with ~375-400 MB/s throughput. These levels of IO are sustained once they begin; as expected, restarting the server has no effect. During that time, we would have run the log cleaner around 48 times (once per hour). The cleaner removes data by age in the following way:

From Bucket #1, it only cleans up data in the sub buckets in ascending order. The top level variable keys are maintained and never touched.
From Bucket #2, keys are removed in ascending order
From Bucket #3, keys are removed in generally ascending order

Here's a profile of the time it took to complete the clean up batches with their duration and the number of rows removed (from the default bucket): screen shot 2017-05-23 at 1 26 31 pm

I ran the bolt tool against the database at 32GB:

check: https://gist.github.com/chuckg/bf847607adaef03b7de19b6c0ff1d9b0#file-bolt-check-32gb -- the check had a number of "unreachable unfreed" errors in the database.
stats: https://gist.github.com/chuckg/bf847607adaef03b7de19b6c0ff1d9b0#file-bolt-stats-32gb

Then we compacted the database, which took it to ~17GB:

bash-4.3# bolt compact -o compact.db pglx.db
33956966400 -> 17677443072 bytes (gain=1.92x)
bash-4.3# du -hs *
17G    compact.db
32G    pglx.db

Booting the server against the 17GB database, the IOPS/throughput returned to normal for around 12 hours. At that point, everything returned to the same level degradation as the 32GB database:

check: https://gist.github.com/chuckg/bf847607adaef03b7de19b6c0ff1d9b0#file-bolt-check-18gb-compacted-run-for-12-hours
stats: https://gist.github.com/chuckg/bf847607adaef03b7de19b6c0ff1d9b0#file-bolt-stats-18gb-compacted-run-for-12-hours
stats from DB.Stats(), with cumulative and a 10s diff: https://gist.github.com/chuckg/bf847607adaef03b7de19b6c0ff1d9b0#file-db-stats-18gb-10-second-diff-included

Finally, here's cpu/alloc from pprof on the ~18GB server while it was pegging the IO:

cpu profile: https://rawgit.com/chuckg/bf847607adaef03b7de19b6c0ff1d9b0/raw/b2f56542a36799d836ea2ab710fa362454d642ca/18GB.pprof.sample.cpu.svg
alloc_objects: https://rawgit.com/chuckg/bf847607adaef03b7de19b6c0ff1d9b0/raw/b2f56542a36799d836ea2ab710fa362454d642ca/18GB.pprof.alloc_objects.alloc_space.svg

The culprit appears to be runtime.memmove and runtime.memclr, which I'm guessing it spending most of it's time reshuffling our data. Any thoughts on how to address the issue?

May 23 '17 20:05 chuckg