go-dqlite Excessive resource usage with TLS or external connection

Excessive resource usage with TLS or external connection

Open masnax opened this issue 1 year ago • 2 comments

It seems as soon as each cluster member begins regularly querying to view the leader info, the CPU usage of go-dqlite balloons.

This was noticed in microcluster where we have been forced to set our cluster heartbeats to fire every 30 seconds to avoid excessive CPU usage.

The plaintext implementation appears fine, but when starting with WithExternalConn or WithTLS, the service appears to become very inefficient.

Take the following example using WithTLS in the dqlite-demo example daemon. The only change is adding the following loop to each peer to query the dqlite leader every 3 seconds:

// Following the call to http.Serve:
go func() {
        // initially sleep to give the daemon time to set up.
	time.Sleep(5 * time.Second)
	for {
                // get a client to the leader.
		c, err := app.Leader(context.Background())
		if err != nil {
			panic(err)
		}

                // fetch the leader's node info.
		c.Leader(context.Background())
                
                // sleep 3s before trying again.
		time.Sleep(3 * time.Second)
	}
}()

Now launch 3 dqlite-demo peers with TLS enabled:

DNS=$(hostname)
IP=$(hostname -I | cut -f 1 -d ' ')
CN=example.com
openssl req -x509 -newkey rsa:4096 -sha256 -days 3650 \
  -nodes -keyout cluster.key -out cluster.crt -subj "/CN=$CN" \
  -addext "subjectAltName=DNS:$DNS,IP:$IP"

for i in $(seq 1 3) ; do
  if [ ${i} = 1 ] ; then
    dqlite-demo --api 127.0.0.1:800${i} --db 127.0.0.1:900${i} --cert "cluster.crt" --key "cluster.key" &
    continue
  fi

  sleep 1
  dqlite-demo --api 127.0.0.1:800${i} --db 127.0.0.1:900${i} --join 127.0.0.1:9001 --cert "cluster.crt" --key "cluster.key" &
done

Finally, see the result:

health-check -c -p 1663326 | head -20
CPU usage (in terms of 1 CPU):
     PID Process                USR%   SYS% TOTAL%   Duration
  163326 dqlite-demo            4.02   0.90   4.92      60.00  (light load)

Page Faults:
     PID Process                 Minor/sec    Major/sec    Total/sec
  163326 dqlite-demo                 55.20         0.00        55.20

Context Switches:
     PID Process                Voluntary   Involuntary     Total
                               Ctxt Sw/Sec  Ctxt Sw/Sec  Ctxt Sw/Sec
  163333 dqlite-demo                205.45         0.03       205.48 (quite high)
  163328 dqlite-demo                201.15         0.00       201.15 (quite high)
  163326 dqlite-demo                162.78         0.03       162.81 (quite high)
  163330 dqlite-demo                105.57         0.03       105.60 (quite high)
  163329 dqlite-demo                 86.23         0.03        86.27 (moderate)
  163380 dqlite-demo                 71.87         0.08        71.95 (moderate)
  163331 dqlite-demo                 70.47         0.03        70.50 (moderate)
  163335 dqlite-demo                 69.53         0.08        69.62 (moderate)
  163378 dqlite-demo                 57.15         0.02        57.17 (moderate)

Meanwhile top zig-zags between 0 and 10% cpu usage at each "heartbeat" interval:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 163326 root      20   0 2813608  33852  10752 S   9.0   0.1   0:08.66 dqlite-demo

For comparison, I put the same 3-second dqlite leader check to run continually on every node of a 3-member LXD cluster, and oddly LXD still uses less CPU with comparable context switches, and never goes above 2% CPU usage. This is on top of LXD's already-present continual cluster heartbeats. It's worth noting LXD does not use the main dqlite.App implementation.

CPU usage (in terms of 1 CPU):
     PID Process                USR%   SYS% TOTAL%   Duration
  185737 lxd                    0.48   1.03   1.52      60.00  (very light load)

Page Faults:
     PID Process                 Minor/sec    Major/sec    Total/sec
  185737 lxd                          5.33         0.00         5.33

Context Switches:
     PID Process                Voluntary   Involuntary     Total
                               Ctxt Sw/Sec  Ctxt Sw/Sec  Ctxt Sw/Sec
  185755 lxd                        288.58         0.02       288.60 (quite high)
  185840 lxd                        217.03         0.02       217.05 (quite high)
  185762 lxd                        166.83         0.17       167.00 (quite high)
  185758 lxd                        152.43         0.10       152.53 (quite high)
  185757 lxd                        132.81         0.02       132.83 (quite high)
  185778 lxd                        132.01         0.02       132.03 (quite high)
  185837 lxd                         93.98         0.03        94.02 (moderate)
  185836 lxd                         65.27         0.00        65.27 (moderate)
  185834 lxd                         48.82         0.02        48.83 (moderate)

Jul 04 '24 21:07 masnax

go-dqlite go-dqlite copied to clipboard

Excessive resource usage with TLS or external connection

go-dqlite
go-dqlite copied to clipboard