bolt icon indicating copy to clipboard operation
bolt copied to clipboard

Performance problems with large freelists

Open josharian opened this issue 8 years ago • 9 comments
trafficstars

This is a follow-up to #636.

I have a large db, now at ~70gb. It contains a mix of large (>= page size) and small values. The length of the freelist is currently 1043564. For every db update, the freelist gets copied once or twice: possibly at the beginning to merge in pages from old transactions (freelist.release) and always at the end to commit (freelist.write). That means that for every update, bolt must allocate 8mb or 16mb, and then flush 2043 pages of freelist to disk. That is slow.

For your amusement, some big numbers from pprof alloc_space:

screen shot 2016-12-23 at 7 21 12 am

This is blocking me right now, so I'm more than happy to put in the legwork to fix it, but I'd like input on direction. Some thoughts:

  • The obvious question is how the freelist got so large. There appear to be two causes.

    • Fragmentation. The current freelist implementation greedily allocates from the front of the list. For some usage patterns, this can cause fragmentation.
    • Large chunks from database growth. When the db grows by a large amount, that entire chunk of new pages gets tracked one by one.

    I posted a gist of contiguous page distributions from my db. (Measured at a different time than above; the numbers don't sum to 1043564, but the behavior is similar.) Column one is the number of contiguous pages. Column two is the number of runs of pages of that length in the freelist. Column three is the number of entries in the freelist accounted for by such runs (col 1 * col2). You can see fragmentation (lots of runs of small length) and also big swaths of free space (very long runs). In terms of contributions to freelist length, very small runs and very large runs appear to contribute (very roughly) 50% each.

    Possible fixes for this include:

    • Use a less greedy but more expensive freelist.allocate that attempts to allocate the smallest contiguous run of pages that satisfies its needs.
    • Switch to run-length-encoding for freelists (including possibly altering how the cache works).
    • Write a tool to rewrite databases to shrink the freelist. Not sure how. I tried bolt compact. It had zero impact on freelist size and made the db marginally larger.
  • We could try to further address the symptoms.

    • Half the allocations come from db.allocate allocating a giant byte slice. But the byte slice size needed to hold the freelist grows rarely. So a special byte slice just for freelists during write transactions could be added to struct db and re-used for this purpose. Alternatively, instead of a pool of single page byte slices, we could make a dynamic set of byte slice pools, caching whatever lengths of byte slices is most popular at the time.
    • Half the allocations come from freelist.release. We could introduce a reuseable slice here. Or We could make freelist.release move pending transactions into a second slice instead of the main ids slice. freelist.allocate would then effectively dynamically merge f.ids and the secondary slice as it read them. The full merge could happen at commit time, at which point the optimization in #636 can kick in.

Input would be lovely.

josharian avatar Dec 23 '16 16:12 josharian

@josharian Thanks for digging in. What size keys & values are you using in the bolt database?

benbjohnson avatar Dec 23 '16 17:12 benbjohnson

Gist with full distribution.

Column 1 is the percent of all k/v pairs with this distribution. Column 2 is the number of all k/v pairs with this distribution. Column 3 is the key length. Column 4 is the value length.

So 28.46% of my k/v pairs (5788848 of them) have key length 34 and value length 1.

FWIW, last night I wrote a migration script that broke those 8192 length values into values of length 4000, figuring those would fit on a page, and hoping it would solve the fragmentation problem. Took hours to run. :) Sadly, there was no perceivable impact on the freelist-related issues, and the db size increased by 10%. And I realized afterwards that when the freelist itself requires multiple pages (which it is guaranteed to do when growing the db significantly) and gets written out anew on every transaction, the freelist itself is a source of freelist fragmentation. :/

josharian avatar Dec 23 '16 18:12 josharian

Oh, and those keys and values are stored in just two buckets. I can easily pull the distribution per bucket. It takes about 26 minutes to parse the file, so I can have it in about an hour. :)

josharian avatar Dec 23 '16 18:12 josharian

Huh, that was fast. Bucket 1 and bucket 2.

josharian avatar Dec 23 '16 18:12 josharian

FWIW, I just tried splitting the 8192-sized values and the 1-sized into separate buckets, in case the bimodal space distribution was a source of problems. Migration running now but I can already see from the profile that it is not going to fix things.

josharian avatar Dec 23 '16 18:12 josharian

@benbjohnson and I discussed this off-list. Highlights:

  • I have found a temporary workaround by shifting a lot of my data out of bolt.
  • I am going to investigate using run length encoding for freelists.
  • I am going to investigate doing more buffer reuse.

Leaving this open for now, pending my investigations.

josharian avatar Dec 28 '16 20:12 josharian

@josharian

We hit the same issue in etcd. etcd keeps all historical keys in boltdb. Users can compact keys, thus suddenly free up a lot of boltdb pages. Then the performance of boltdb starts to drop due to the issue you described.

And more interestingly, if there is a readonly happens around the time, boltdb will keep on putting HUGE freelist into freelist's pending, which results even more free pages after the readonly transaction releases.

We are working on reusing the freed pages from finished write txns when there is readonly one. This can mitigate the second effect. But still, the first one is highly undesirable for our use case.

Have you tried any thing yet?

/cc @wojtek-t @heyitsanthony

xiang90 avatar Jun 02 '17 06:06 xiang90

Glad (in a way) to hear that this was not just me.

I did write some experimental RLE code, but I don't recall now where it landed. I stopped because @benbjohnson indicated it was probably not a good use of my time. My normal laptop is in the shop, but when it returns in a few days, I can dig up and share what I wrote, although you almost certainly would not want to use it directly (needs review, testing, polishing, etc.).

Also: Hi @wojtek-t! I think I last saw you over in Kubernetes world. :)

josharian avatar Jun 02 '17 13:06 josharian

Oops. Found this in my inbox. I've push my runlength-based implementation at https://github.com/josharian/bolt in branch runlength. Just two commits: https://github.com/josharian/bolt/commit/81f86c1633c2dd19a1a77bcec0cf818b447592e0 and then https://github.com/josharian/bolt/commit/7d446413885903d16d460dc49d72ad9244694ce9. If the commit messages do not inspire confidence...they shouldn't. :) Caveat lector!

josharian avatar Jul 21 '17 19:07 josharian