dragonboat Performance testing problem

Hi i bulid a version base on dragonboat-example/helloworld, Dragonboat can just sustain writes at 14 w/s when payload is 16 bytes each.

1.389829122s 6.949 micros/op, 143902.58 op/s, 0.000000 err/rate

It is quite different from that in the document（Dragonboat can sustain writes at 1.25 million per second when payload is 16 bytes each, Can you provide your pressure test program or give some suggestions to me ？

rc := config.Config{
	// ClusterID and NodeID of the raft node
	NodeID:    uint64(*nodeID),
	ClusterID: exampleClusterID,
	// In this example, we assume the end-to-end round trip time (RTT) between
	// NodeHost instances (on different machines, VMs or containers) are 200
	// millisecond, it is set in the RTTMillisecond field of the
	// config.NodeHostConfig instance below.
	// ElectionRTT is set to 10 in this example, it determines that the node
	// should start an election if there is no heartbeat from the leader for
	// 10 * RTT time intervals.
	ElectionRTT: 10,
	// HeartbeatRTT is set to 1 in this example, it determines that when the
	// node is a leader, it should broadcast heartbeat messages to its followers
	// every such 1 * RTT time interval.
	HeartbeatRTT: 1,
	CheckQuorum:  true,
	// SnapshotEntries determines how often should we take a snapshot of the
	// replicated state machine, it is set to 10 her which means a snapshot
	// will be captured for every 10 applied proposals (writes).
	// In your real world application, it should be set to much higher values
	// You need to determine a suitable value based on how much space you are
	// willing use on Raft Logs, how fast can you capture a snapshot of your
	// replicated state machine, how often such snapshot is going to be used
	// etc.
	SnapshotEntries: 6000000,
	// Once a snapshot is captured and saved, how many Raft entries already
	// covered by the new snapshot should be kept. This is useful when some
	// followers are just a little bit left behind, with such overhead Raft
	// entries, the leaders can send them regular entries rather than the full
	// snapshot image.
	CompactionOverhead: 1000000,
}

Jul 28 '21 11:07 enefuture

@enefuture

Please note that -

you should be using the same hardware & benchmark program (which is never released) when reproducing the claimed benchmark.
dragonboat's provided example program is not for benchmarking purposes, I believe the code & docs contain such explanations.

for the extremely low throughput you are seeing, please check what is your disk's fsync performance. Note that home grade NVME SSDs are known to be slow, all MacBook Pro laptops tested so far also come with SSDs with slow fsync performance.

Jul 29 '21 06:07 lni

@lni I experience the same with the ondisk example. Only 6 comitted sync proposes per second are possible on my new Macbook Pro. Fsync can't be the reason for such a big gap. Any hints how to tune it or how to run the benchmarks shown in the graphs?

Thanks!

Aug 13 '21 20:08 sebholstein

@SebastianM any chance you can run the same program on a different machine, say on a linux machine with ssd?

Aug 14 '21 02:08 lni

@lni I ran it on two linux machines now:

A small 512 MB Ram Droplet on DigitalOcean with an SSD
My desktop machine with an AMD Ryzen 7 3700X (16) @ 3.600GHz and a fast Samsung NVMe SSD and Arch Linux

On the DigitalOcean VM I got at best 700 sync proposes per second. After a few minutes, one node chrashed with a panic:

panic: lastApplied not moving forward

goroutine 187 [running]:
main.(*DiskKV).Update(0xc0007a9ce0, 0xc0001568c0, 0x1, 0x1, 0x0, 0x0, 0x0, 0x0, 0x0)
	/Users/Sebastian.Holstein/_personal_projects/dragonboat-example/ondisk/diskkv.go:402 +0x53e
github.com/lni/dragonboat/v3/internal/rsm.(*OnDiskStateMachine).Update(0xc000156880, 0xc0001568c0, 0x1, 0x1, 0xc001adccc0, 0x3a, 0x40, 0x0, 0x0)
	/Users/Sebastian.Holstein/go/pkg/mod/github.com/lni/dragonboat/[email protected]/internal/rsm/adapter.go:295 +0x5b

On my linux desktop machine I achieved only ~400 per second, which I didn't expected compared to an $5 VM.

Here's the modified code of the main.go. The rest of the example code is the same. Am I doing anything wrong here?:

// Copyright 2017,2018 Lei Ni ([email protected]).
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
//     http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

/*
ondisk is an example program for dragonboat's on disk state machine.
*/
package main

import (
	"context"
	"encoding/json"
	"flag"
	"fmt"
	"log"
	"math/rand"
	"os"
	"os/signal"
	"path/filepath"
	"runtime"
	"strconv"
	"sync/atomic"
	"syscall"
	"time"

	"github.com/lni/dragonboat/v3"
	"github.com/lni/dragonboat/v3/config"
	"github.com/lni/dragonboat/v3/logger"
	"github.com/lni/goutils/syncutil"
)

type RequestType uint64

const (
	exampleClusterID uint64 = 128
)

const (
	PUT RequestType = iota
	GET
)

var (
	// initial nodes count is fixed to three, their addresses are also fixed
	addresses = []string{
		"localhost:63001",
		"localhost:63002",
		"localhost:63003",
	}
)

func printUsage() {
	fmt.Fprintf(os.Stdout, "Usage - \n")
	fmt.Fprintf(os.Stdout, "put key value\n")
	fmt.Fprintf(os.Stdout, "get key\n")
}

var rps uint64
var rpsErr uint64

func main() {
	nodeID := flag.Int("nodeid", 1, "NodeID to use")
	addr := flag.String("addr", "", "Nodehost address")
	join := flag.Bool("join", false, "Joining a new node")
	flag.Parse()
	if len(*addr) == 0 && *nodeID != 1 && *nodeID != 2 && *nodeID != 3 {
		fmt.Fprintf(os.Stderr, "node id must be 1, 2 or 3 when address is not specified\n")
		os.Exit(1)
	}
	// https://github.com/golang/go/issues/17393
	if runtime.GOOS == "darwin" {
		signal.Ignore(syscall.Signal(0xd))
	}
	initialMembers := make(map[uint64]string)
	if !*join {
		for idx, v := range addresses {
			initialMembers[uint64(idx+1)] = v
		}
	}
	var nodeAddr string
	if len(*addr) != 0 {
		nodeAddr = *addr
	} else {
		nodeAddr = initialMembers[uint64(*nodeID)]
	}
	fmt.Fprintf(os.Stdout, "node address: %s\n", nodeAddr)
	logger.GetLogger("raft").SetLevel(logger.ERROR)
	logger.GetLogger("rsm").SetLevel(logger.ERROR)
	logger.GetLogger("transport").SetLevel(logger.ERROR)
	logger.GetLogger("grpc").SetLevel(logger.ERROR)
	logger.GetLogger("dragonboat").SetLevel(logger.ERROR)
	rc := config.Config{
		NodeID:             uint64(*nodeID),
		ClusterID:          exampleClusterID,
		ElectionRTT:        10,
		HeartbeatRTT:       1,
		CheckQuorum:        true,
		SnapshotEntries:    10000,
		CompactionOverhead: 5,
	}
	datadir := filepath.Join(
		"example-data",
		"helloworld-data",
		fmt.Sprintf("node%d", *nodeID))
	nhc := config.NodeHostConfig{
		WALDir:         datadir,
		NodeHostDir:    datadir,
		RTTMillisecond: 10,
		RaftAddress:    nodeAddr,
	}
	nh, err := dragonboat.NewNodeHost(nhc)
	if err != nil {
		panic(err)
	}
	if err := nh.StartOnDiskCluster(initialMembers, *join, NewDiskKV, rc); err != nil {
		fmt.Fprintf(os.Stderr, "failed to add cluster, %v\n", err)
		os.Exit(1)
	}
	raftStopper := syncutil.NewStopper()
	consoleStopper := syncutil.NewStopper()
	consoleStopper.RunWorker(func() {
		for {
			log.Println("RPS", rps, "errors", rpsErr)
			atomic.SwapUint64(&rps, 0)
			atomic.SwapUint64(&rpsErr, 0)
			time.Sleep(time.Second)
		}
	})
	printUsage()
	raftStopper.RunWorker(func() {
		time.Sleep(time.Second * 10)
		log.Println("Start")
		cs := nh.GetNoOPSession(exampleClusterID)
		ctx := context.Background()
		for i := 0; i < 5; i++ {
			go func() {
				for {
					ctx, cancel := context.WithTimeout(ctx, time.Second*3)
					data := strconv.Itoa(int(rand.Uint64()))
					kv := &KVData{
						Key: data,
						Val: data,
					}
					b, _ := json.Marshal(kv)
					_, err = nh.SyncPropose(ctx, cs, b)
					cancel()
					if err == nil {

						atomic.AddUint64(&rps, 1)
					} else {

						atomic.AddUint64(&rpsErr, 1)
					}
				}
			}()

		}
		select {}
	})
	raftStopper.Wait()
}

Aug 14 '21 08:08 sebholstein

@SebastianM I'd willing to bet that your SAMSUNG nvme SSD is a consumer grade one. fsync performance of such consumer grade SSDs are in general terrible. Please see some test results below.

https://www.percona.com/blog/2018/02/08/fsync-performance-storage-devices/

I will look into that panic issue, it seems to be from the example code.

Aug 16 '21 16:08 lni

Note that Pebble (the default LogDB implementation) is particularly slow on MacOS (darwin) OS. To be precise the fsync operation is terribly slow. https://github.com/cockroachdb/pebble/issues/1028 I can confirm this as we faced the same issue ourselves.

Sep 02 '21 17:09 coufalja

Hello, I built up on an existing benchmark and deployed it to an AWS r5dn.large instance. The log is configured to be in-memory. Three raft nodes are deployed. I also disabled snapshotting and log compaction.

From my testings, the reads generally were around 70% slower than etcd. The writes even take more than twice as long as etcd or hashicorp.

@lni Do you see any flaws in the way the key-value store is implemented? You find the code under: https://github.com/alexanderstephan/raft-bench/tree/master/dragonboat

I will provide some graphs with more details soon.

Jan 15 '23 16:01 alexanderstephan

@alexanderstephan

It appears that your benchmarks are making local reads in etcd and linearlizable reads in dragonboat.

Using ReadLocalNode rather than SyncRead in the dragonboat benchmark would be a more fair comparison for reads.
https://pkg.go.dev/github.com/lni/dragonboat/v4#NodeHost.ReadLocalNode

As for writes, your etcd raft node is configured to make asynchronous writes. You push a value to a channel without waiting for confirmation that the proposal was committed or applied.

Your dragonboat node, on the other hand, is configured to make synchronous writes. SyncPropose will not return until the proposed entry has been committed and applied.

These two examples are operating at different levels of durability and consistency. Your etcd raft configuration does not guarantee that proposals are accepted, nor that they have been applied. Proposals may be lost and values that do get written are not guaranteed to be available on subsequent reads since reads are local and writes only guarantee that the proposals have been committed and will eventually be applied. Your dragonboat example, in contrast, provides both of these guarantees and is truly linearizable. As such, it's not surprising that their performance differs.

Relaxing durability and consistency is a pretty easy way to improve performance. Dragonboat has methods to help build systems with lower consistency requirements (ie. ReadLocalNode, NotifyCommit).

You may also consider testing at different levels of concurrency. The single-client configuration in util.benchmark fails to leverage Dragonboat's pipelining and batching which can significantly improve performance in more realistic workloads with hundreds of clients.

Jan 15 '23 22:01 kevburnsjr

@kevburnsjr Thanks so much, that must explain it! I actually have implemented mechanisms to await proposals and retry logic for etcd in another project, but completely forgot about it here...

Hashicorp was even faster than etcd, but as it turns out it is also configured to be async.

I will extend the benchmarks based on your suggestions. I am really curious about the changes in the result.

Btw, are you aware of any other direct performance comparison between etcd, hashicorp and dragonboat or this something new? I was a bit puzzled that I have not found anything in this regard.

Jan 15 '23 23:01 alexanderstephan

Thanks @kevburnsjr for the detailed insight above.

With etcd (last checked maybe 12 months ago), it comes with a reasonably efficient log storage implementation designed for a small number of raft shards. It becomes pretty much unusable when you increase the shard count to say hundreds or a couple thousand as each of them will require its own log storage file that must be be fsynced for each write. You end up with hundreds of slow fsync calls when you make a proposal to each of those shards. On comparison, a small fixed & configurable number of fsyncs (up to 16 by default) will be called in dragonboat when writing to thousands of shards.

It also lacks facilities to efficiently handle large number of concurrent requests from large number of clients.

All these are related to the pipelining and batching features in dragonboat, as explained by @kevburnsjr above. In my opinion, it is a pretty essential feature to have, as you can't expect most users to build their own execution pipeline or internal batching for logging.

Note that this is not aimed to argue which one is better, just want to let people know the difference.

Jan 21 '23 15:01 lni

dragonboat dragonboat copied to clipboard

Performance testing problem

dragonboat
dragonboat copied to clipboard