go-benchmarks icon indicating copy to clipboard operation
go-benchmarks copied to clipboard

Storage engine benchmarks

Open kellabyte opened this issue 6 years ago • 11 comments

We should create some independent benchmarks for storage engines. We can get input from the various storage engine authors to make sure we are creating apples to apples comparisons and using them with the right settings but we get to make sure nothing unfair happens.

We can compare Go native and CGO based storage engines.

Setup

[2]  Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
[16] 1600 MHz 8GB DDR-3 Single Rank DIMM
[1]  Raid Controller
[8]  Intel S3520 SSD 240GB
[1]  Broadcom 5720 1Gbps Ethernet Adapter
[2]  Intel Ethernet Server Adapter X520

kellabyte@kbperf1:~$ sudo hdparm -Tt /dev/sda
/dev/sda:
 Timing cached reads:   22046 MB in  1.99 seconds = 11059.90 MB/sec
 Timing buffered disk reads: 2654 MB in  3.00 seconds = 884.56 MB/sec

kellabyte@kbperf1:~$ sudo hdparm -Tt /dev/sda
/dev/sda:
 Timing cached reads:   21698 MB in  1.99 seconds = 10882.74 MB/sec
 Timing buffered disk reads: 2818 MB in  3.00 seconds = 939.18 MB/sec

kellabyte@kbperf1:~$ sudo hdparm -Tt /dev/sda
/dev/sda:
 Timing cached reads:   21800 MB in  1.99 seconds = 10933.80 MB/sec
 Timing buffered disk reads: 2828 MB in  3.00 seconds = 942.53 MB/sec

kellabyte@kbperf2:~$ sudo hdparm -Tt /dev/sda
/dev/sda:
 Timing cached reads:   21900 MB in  1.99 seconds = 10982.74 MB/sec
 Timing buffered disk reads: 2800 MB in  3.00 seconds = 933.14 MB/sec

kellabyte@kbperf2:~$ sudo hdparm -Tt /dev/sda
/dev/sda:
 Timing cached reads:   21856 MB in  1.99 seconds = 10957.83 MB/sec
 Timing buffered disk reads: 2822 MB in  3.00 seconds = 940.41 MB/sec

kellabyte@kbperf2:~$ sudo hdparm -Tt /dev/sda
/dev/sda:
 Timing cached reads:   21952 MB in  1.99 seconds = 11011.40 MB/sec
 Timing buffered disk reads: 2810 MB in  3.00 seconds = 936.53 MB/sec

Requirements

  1. Document guarantees like we did for queues so that performance numbers has a context of what safety and correctness guarantees are provided. Some storage engines make tradeoffs for performance, some for correctness.
  2. Benchmark sequential writes of different sizes.
  3. Benchmark random writes of different sizes.
  4. Benchmark mixed workloads of different ratios.
  5. Measure latency histograms properly with coordinated omission.
  6. Include longer 72+ hour benchmarks to measure degradation.

Storage engines

Architecture

I propose an architecture where each storage engine implements a standard HTTP API implemented in Go. This ensures the workload client code is in 1 standard place and utilizes excellent HTTP based reporting tools.

The HTTP benchmark tool will be implemented using wrk and wrk2 and their LUA scripting capabilities.

  1. Wrk has excellent max throughput benchmarking performance.
  2. Wrk2 has properly accounted for coordinated omission latency distribution reporting.

I have a minimal HTTP server that can currently handle 15M ops/sec at 3% CPU utilization on the server that can handle 1.5GB/sec of requests.

Included in the benchmarks will be 2 baseline results so that we can measure the HTTP overhead and the cgo overhead.

Concerns

People may argue that C libraries like LMDB and RocksDB have a disadvantage due to the CGO overhead but I suspect in long term tests this won't be as visible. A lot of infrastructure is being written in go and we want to benchmark how these behave.

By benchmarking baselines of HTTP and HTTP+CGO we can visualize or subtract the overhead in the results.

Proposed benchmark variations

Each benchmark variation we come up with there will be a short and long (72 hours?) version.

TODO: Define variations with the community.

kellabyte avatar Apr 27 '18 22:04 kellabyte

Guarantees

+1 to documenting what guarantees are provided. My favorite one is: what are the durability guarantees for writes? In particular, we should ensure apples to apples comparison with regards to when data is fsynced to disk. Some do it before the write is acknowledged, some do it periodically.

Factors to Consider for Benchmarks

  • Worth considering a benchmark that does sequential scans of various sizes, if the engine supports it?
  • For random reads / writes, it's worth considering both a uniform distribution and also a more zipfian like distribution. Hot keys can reveal interesting bottlenecks in the system.
  • I've seen two styles of benchmarks: N threads going as fast as they can, versus applying N QPS of load. I tend to prefer the latter, especially for latency measurements (it's hard to do coordinated omission otherwise),
  • Benchmark where X% of deletes/overwrites are mixed in as well. Some LSM databases, and potentially others, have degraded read performance in the presence of large fractions of deleted / overwritten data.
  • Total dataset size. data << RAM, data ~ 2*RAM, data >> RAM are all potentially interesting scenarios.
  • SSD or HDD. SSD is likely the way to go as that probably puts more stress on the performance of the storage engine itself.

danchia avatar Apr 28 '18 05:04 danchia

I'd love to see how the SQLite B-tree that FDB is currently using stacks up against them as well.

spullara avatar Apr 28 '18 18:04 spullara

@spullara Link? Is there a Go wrapper for it currently?

kellabyte avatar May 08 '18 05:05 kellabyte

The current selection of engines looks good; all of them claim to have ACID txn support. I assume this will be a requirement for any other tested engines as well.

@danchia Nice list.

  • durability: at least one test configuration should require all writes to be fully durable before returning to the client.
  • hard to imagine a usable DB that doesn't support sequential scans. This raises the question of ordered vs unordered stores. Presumably only K/V stores with ordered mappings are being considered here. It also raises the question of how much scope to attempt in terms of unique features of each engine.
  • some non-uniform distribution is probably a good idea. I've used pareto 80/20 more often, but this is going to be pretty dependent on individual use cases.
  • I prefer N threads going as fast as they can, because it tells you how well the hardware is being utilized. Fixed QPS load is also use-case specific.
  • data >> RAM is good to know but a pain to test. You'll be weeks or months collecting this data.
  • SSD or HDD - at this point HDD could stretch the test cycle out to years. I would consider skipping HDDs entirely; they just don't have enough IOPS to be worthwhile. Not when it's so cheap to get multiple orders of magnitude more IOPS with NVME etc.

hyc avatar May 08 '18 06:05 hyc

I've posted the hardware specs of the benchmark server.

@hyc I think most benchmarks should be durable because that's the mode most people are using these.

Let's start identifying some benchmark variations. How about we begin with 4 variations and expand from there by adding variations to them.

  1. A sequential write test. How big are the keys? How big are the values? What is the fsync policy?
  2. A sequential read test.
  3. A random write test. How big are the keys? How big are the values? What is the fsync policy?
  4. A random read test.

kellabyte avatar May 08 '18 15:05 kellabyte

I'm thinking a common key type that databases use are UUID's. Some people use optimized UUID's that are 12 bytes but probably the most common are 16 byte UUID's. Some people use sortable UUID's some people don't.

I was thinking of using these lua scripts to create 16 byte uuid keys that are sequential and random.

https://gist.github.com/catwell/1e022833ae849180adf58d72245ce8e0

kellabyte avatar May 08 '18 20:05 kellabyte

Let's discuss fsync policies. I would really appreciate @hyc and @mdcallag input here. Different storage engines have different durability guarantees and I want to be able to show their strengths and weaknesses.

Should we define a set of batch sizes to fsync? What do you guys suggest?

kellabyte avatar May 10 '18 16:05 kellabyte

@danchia Yeah I plan to have 2 different tools for benchmarks. wrk for throughput and wrk2 which does a QPS load like workload to calculate latency distribution properly. wrk2 contains Gil Tene's coordinated omission modifications.

kellabyte avatar May 10 '18 16:05 kellabyte

A typical time series workload (at least in my experience) has lots of frequent random writes and occasional large scan reads. The records would be like

(series ID, timestamp) => (value) or (timestamp, series ID) => (value)

The first version is optimized for reads of a few series for some time range. The second is better for reads of lots/all series for some time range. The first also has more random writes. Might be good to try both.

Preetam avatar May 22 '18 21:05 Preetam

Oh, and it would be great to see how throughput and latency changes with various numbers of concurrent readers and writers.

Preetam avatar May 22 '18 21:05 Preetam

Great idea. Any news?

SamuelMarks avatar Sep 07 '18 03:09 SamuelMarks