go-dqlite
go-dqlite copied to clipboard
Excessive resource usage with TLS or external connection
It seems as soon as each cluster member begins regularly querying to view the leader info, the CPU usage of go-dqlite
balloons.
This was noticed in microcluster where we have been forced to set our cluster heartbeats to fire every 30 seconds to avoid excessive CPU usage.
The plaintext implementation appears fine, but when starting with WithExternalConn
or WithTLS
, the service appears to become very inefficient.
Take the following example using WithTLS
in the dqlite-demo
example daemon. The only change is adding the following loop to each peer to query the dqlite leader every 3 seconds:
// Following the call to http.Serve:
go func() {
// initially sleep to give the daemon time to set up.
time.Sleep(5 * time.Second)
for {
// get a client to the leader.
c, err := app.Leader(context.Background())
if err != nil {
panic(err)
}
// fetch the leader's node info.
c.Leader(context.Background())
// sleep 3s before trying again.
time.Sleep(3 * time.Second)
}
}()
Now launch 3 dqlite-demo peers with TLS enabled:
DNS=$(hostname)
IP=$(hostname -I | cut -f 1 -d ' ')
CN=example.com
openssl req -x509 -newkey rsa:4096 -sha256 -days 3650 \
-nodes -keyout cluster.key -out cluster.crt -subj "/CN=$CN" \
-addext "subjectAltName=DNS:$DNS,IP:$IP"
for i in $(seq 1 3) ; do
if [ ${i} = 1 ] ; then
dqlite-demo --api 127.0.0.1:800${i} --db 127.0.0.1:900${i} --cert "cluster.crt" --key "cluster.key" &
continue
fi
sleep 1
dqlite-demo --api 127.0.0.1:800${i} --db 127.0.0.1:900${i} --join 127.0.0.1:9001 --cert "cluster.crt" --key "cluster.key" &
done
Finally, see the result:
health-check -c -p 1663326 | head -20
CPU usage (in terms of 1 CPU):
PID Process USR% SYS% TOTAL% Duration
163326 dqlite-demo 4.02 0.90 4.92 60.00 (light load)
Page Faults:
PID Process Minor/sec Major/sec Total/sec
163326 dqlite-demo 55.20 0.00 55.20
Context Switches:
PID Process Voluntary Involuntary Total
Ctxt Sw/Sec Ctxt Sw/Sec Ctxt Sw/Sec
163333 dqlite-demo 205.45 0.03 205.48 (quite high)
163328 dqlite-demo 201.15 0.00 201.15 (quite high)
163326 dqlite-demo 162.78 0.03 162.81 (quite high)
163330 dqlite-demo 105.57 0.03 105.60 (quite high)
163329 dqlite-demo 86.23 0.03 86.27 (moderate)
163380 dqlite-demo 71.87 0.08 71.95 (moderate)
163331 dqlite-demo 70.47 0.03 70.50 (moderate)
163335 dqlite-demo 69.53 0.08 69.62 (moderate)
163378 dqlite-demo 57.15 0.02 57.17 (moderate)
Meanwhile top
zig-zags between 0 and 10% cpu usage at each "heartbeat" interval:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
163326 root 20 0 2813608 33852 10752 S 9.0 0.1 0:08.66 dqlite-demo
For comparison, I put the same 3-second dqlite leader check to run continually on every node of a 3-member LXD cluster, and oddly LXD still uses less CPU with comparable context switches, and never goes above 2% CPU usage. This is on top of LXD's already-present continual cluster heartbeats. It's worth noting LXD does not use the main dqlite.App
implementation.
CPU usage (in terms of 1 CPU):
PID Process USR% SYS% TOTAL% Duration
185737 lxd 0.48 1.03 1.52 60.00 (very light load)
Page Faults:
PID Process Minor/sec Major/sec Total/sec
185737 lxd 5.33 0.00 5.33
Context Switches:
PID Process Voluntary Involuntary Total
Ctxt Sw/Sec Ctxt Sw/Sec Ctxt Sw/Sec
185755 lxd 288.58 0.02 288.60 (quite high)
185840 lxd 217.03 0.02 217.05 (quite high)
185762 lxd 166.83 0.17 167.00 (quite high)
185758 lxd 152.43 0.10 152.53 (quite high)
185757 lxd 132.81 0.02 132.83 (quite high)
185778 lxd 132.01 0.02 132.03 (quite high)
185837 lxd 93.98 0.03 94.02 (moderate)
185836 lxd 65.27 0.00 65.27 (moderate)
185834 lxd 48.82 0.02 48.83 (moderate)