Flatcar icon indicating copy to clipboard operation
Flatcar copied to clipboard

etcd: higher disk latency and round trip time since 2765.2.0

Open ribbybibby opened this issue 3 years ago • 2 comments

We're seeing slower network peer round trip times for our bare metal etcd nodes since the update from stable 2605.12.0 to 2765.2.0 (and now 2765.2.1). This correlates with an increase in wal fsync and backend commit durations.

The 99th percentile for etcd_network_peer_round_trip_time_seconds has increased from an average of 100ms to 400ms.

Rolling back to 2605.12.0 restores the previous level of performance, rolling forward again results in the same issue. So the version of Flatcar is very much relevant.

Here are some graphs that illustrate the issue. Our nodes rolled into 2765.2.0 on March 4th.

Round trip time: Screenshot 2021-03-19 at 12 41 28

Wal fsync duration: Screenshot 2021-03-19 at 11 51 54

Backend commit duration: Screenshot 2021-03-19 at 11 51 29

Weighted IOs: Screenshot 2021-03-19 at 11 51 43

I ran fio on one of the nodes in 2605.12.0 and 2765.2.1 for comparison using the arguments suggested here:

fio --rw=write --ioengine=sync --fdatasync=1 --directory=/var/lib/etcd/tmp --size=22m --bs=2300 --name=etcd

2605.12.0:

etcd-2605.12.0-0: (g=0): rw=write, bs=(R) 2300B-2300B, (W) 2300B-2300B, (T) 2300B-2300B, ioengine=sync, iodepth=1
fio-3.26-19-ge7e53
Starting 1 process
etcd-2605.12.0-0: Laying out IO file (1 file / 22MiB)

etcd-2605.12.0-0: (groupid=0, jobs=1): err= 0: pid=19: Fri Mar 19 09:51:24 2021
  write: IOPS=1508, BW=3387KiB/s (3469kB/s)(22.0MiB/6650msec); 0 zone resets
    clat (usec): min=4, max=121, avg=14.26, stdev= 6.26
     lat (usec): min=4, max=122, avg=14.65, stdev= 6.48
    clat percentiles (usec):
     |  1.00th=[    7],  5.00th=[    8], 10.00th=[    9], 20.00th=[   10],
     | 30.00th=[   11], 40.00th=[   12], 50.00th=[   15], 60.00th=[   15],
     | 70.00th=[   15], 80.00th=[   19], 90.00th=[   24], 95.00th=[   26],
     | 99.00th=[   35], 99.50th=[   36], 99.90th=[   42], 99.95th=[  109],
     | 99.99th=[  114]
   bw (  KiB/s): min= 3014, max= 3688, per=99.87%, avg=3383.69, stdev=217.19, samples=13
   iops        : min= 1342, max= 1642, avg=1506.69, stdev=96.73, samples=13
  lat (usec)   : 10=25.62%, 20=58.81%, 50=15.50%, 100=0.02%, 250=0.06%
  fsync/fdatasync/sync_file_range:
    sync (usec): min=217, max=12143, avg=645.20, stdev=404.94
    sync percentiles (usec):
     |  1.00th=[  225],  5.00th=[  233], 10.00th=[  253], 20.00th=[  265],
     | 30.00th=[  293], 40.00th=[  355], 50.00th=[  734], 60.00th=[  799],
     | 70.00th=[  848], 80.00th=[  906], 90.00th=[  979], 95.00th=[ 1045],
     | 99.00th=[ 1876], 99.50th=[ 1942], 99.90th=[ 2245], 99.95th=[ 4228],
     | 99.99th=[11469]
  cpu          : usr=1.56%, sys=9.46%, ctx=15664, majf=0, minf=15
  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,10029,0,0 short=10029,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=3387KiB/s (3469kB/s), 3387KiB/s-3387KiB/s (3469kB/s-3469kB/s), io=22.0MiB (23.1MB), run=6650-6650msec

Disk stats (read/write):
  sda: ios=0/21265, merge=0/11251, ticks=0/5707, in_queue=47, util=98.59%

2765.2.1:

etcd-2765.2.1-0: (g=0): rw=write, bs=(R) 2300B-2300B, (W) 2300B-2300B, (T) 2300B-2300B, ioengine=sync, iodepth=1
fio-3.26-19-ge7e53
Starting 1 process
etcd-2765.2.1-0: Laying out IO file (1 file / 22MiB)

etcd-2765.2.1-0: (groupid=0, jobs=1): err= 0: pid=104: Fri Mar 19 09:23:09 2021
  write: IOPS=1330, BW=2989KiB/s (3060kB/s)(22.0MiB/7537msec); 0 zone resets
    clat (usec): min=4, max=129, avg=16.99, stdev= 8.10
     lat (usec): min=5, max=130, avg=17.43, stdev= 8.28
    clat percentiles (usec):
     |  1.00th=[    7],  5.00th=[    8], 10.00th=[    9], 20.00th=[   10],
     | 30.00th=[   12], 40.00th=[   14], 50.00th=[   15], 60.00th=[   20],
     | 70.00th=[   23], 80.00th=[   23], 90.00th=[   26], 95.00th=[   28],
     | 99.00th=[   38], 99.50th=[   45], 99.90th=[  117], 99.95th=[  122],
     | 99.99th=[  127]
   bw (  KiB/s): min= 2479, max= 3539, per=100.00%, avg=2990.80, stdev=344.45, samples=15
   iops        : min= 1104, max= 1576, avg=1331.73, stdev=153.37, samples=15
  lat (usec)   : 10=22.56%, 20=39.68%, 50=37.45%, 100=0.18%, 250=0.12%
  fsync/fdatasync/sync_file_range:
    sync (usec): min=216, max=10954, avg=729.63, stdev=452.31
    sync percentiles (usec):
     |  1.00th=[  221],  5.00th=[  227], 10.00th=[  258], 20.00th=[  281],
     | 30.00th=[  322], 40.00th=[  404], 50.00th=[  775], 60.00th=[  914],
     | 70.00th=[  979], 80.00th=[ 1057], 90.00th=[ 1188], 95.00th=[ 1319],
     | 99.00th=[ 2008], 99.50th=[ 2114], 99.90th=[ 2474], 99.95th=[ 4686],
     | 99.99th=[ 9896]
  cpu          : usr=2.03%, sys=9.98%, ctx=19296, majf=0, minf=13
  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,10029,0,0 short=10029,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=2989KiB/s (3060kB/s), 2989KiB/s-2989KiB/s (3060kB/s-3060kB/s), io=22.0MiB (23.1MB), run=7537-7537msec

Disk stats (read/write):
  sda: ios=0/20836, merge=0/11018, ticks=0/5973, in_queue=5973, util=98.73%

And here's the output of some commands that might provide some insight into our setup. Please let me know if there's any other information that could be helpful.

$ etcd --version
etcd Version: 3.4.13
Git SHA: ae9734ed2
Go Version: go1.12.17
Go OS/Arch: linux/amd64
$ mount | grep sda1
/dev/sda1 on /var/lib/etcd type ext4 (rw,relatime,seclabel)
$ sudo fdisk -l /dev/sda1
Disk /dev/sda1: 23.9 GiB, 25605806080 bytes, 50011340 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
$ sudo lshw -c disk -c storage
  *-storage
       description: SATA controller
       product: 8 Series/C220 Series Chipset Family 6-port SATA Controller 1 [AHCI mode]
       vendor: Intel Corporation
       physical id: 1f.2
       bus info: pci@0000:00:1f.2
       version: 05
       width: 32 bits
       clock: 66MHz
       capabilities: storage msi pm ahci_1.0 bus_master cap_list
       configuration: driver=ahci latency=0
       resources: irq:25 ioport:3040(size=8) ioport:3048(size=4) ioport:3050(size=8) ioport:3058(size=4) ioport:3060(size=32) memory:fb7e0000-fb7e07ff
  *-scsi
       physical id: 1
       logical name: scsi4
       capabilities: emulated
     *-disk
          description: ATA Disk
          product: ORIGIN Inception
          physical id: 0.0.0
          bus info: scsi@4:0.0.0
          logical name: /dev/sda
          version: 3A0
          serial: 522492002150124
          size: 238GiB (256GB)
          capabilities: gpt-1.00 partitioned partitioned:gpt
          configuration: ansiversion=5 guid=9e5b4701-156e-4856-a8f1-5ab1e1f7f5f7 sectorsize=512

ribbybibby avatar Mar 19 '21 13:03 ribbybibby

We experience scimark2 test degradation for 15% as well after upgrading to 2765.2.0+

PavloZabudskyi avatar Apr 13 '21 15:04 PavloZabudskyi

Could you check if any of the block schedulers affect the disk latency for you? /sys/block/sda/queue/scheduler. The default may have changed.

jepio avatar Jan 28 '22 15:01 jepio