scylla-monitoring
scylla-monitoring copied to clipboard
Raid0 partition and single NVME disks that comprise that partition metrics don't match
Installation details
Panel Name: Disk Writes/Reads
Dashboard Name: OS Metrics
Scylla-Monitoring Version: 4.7.1
Scylla-Version: 2024.1.3-0.20240401.64115ae91a55
Kernel version on all nodes: 5.15.0-1058-gcp
Description
Throughputs (bytes or OPS) of the RAID0 volume (md0
in screenshots below) is supposed to be equal to a sum of corresponding values on physical disks comprising it.
However it's far from it. In some cases, like in screenshots below, the corresponding value is even less.
In the example below md0
is a RAID0 volume assembled from 4 NVMe disks: nvme0n1,2,3,4
Here is the screenshot showing md0
and only nvme0n1
from all nodes (but the same picture is on all other disks:
Here you can see the values from all disks on a single node clearly showing the problem:
I ran iostat
on one of the node trying to see if this is maybe some kernel issue but no, iostat
shows values that totally add up:
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz aqu-sz %util
md0 323.00 99448.00 0.00 0.00 2.12 307.89 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.68 9.60
nvme0n1 86.00 24836.00 0.00 0.00 2.50 288.79 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.21 6.80
nvme0n2 75.00 23640.00 0.00 0.00 2.33 315.20 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.17 4.80
nvme0n3 80.00 24832.00 0.00 0.00 2.40 310.40 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.19 6.00
nvme0n4 82.00 26140.00 0.00 0.00 2.35 318.78 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.19 6.80
sda 0.00 0.00 0.00 0.00 0.00 0.00 128.00 736.00 3.00 2.29 0.57 5.75 0.00 0.00 0.00 0.00 0.00 0.00 0.07 1.60
avg-cpu: %user %nice %system %iowait %steal %idle
2.03 0.00 1.60 0.00 0.00 96.37
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz aqu-sz %util
md0 430.00 149624.00 0.00 0.00 2.17 347.96 1.00 4.00 0.00 0.00 0.00 4.00 0.00 0.00 0.00 0.00 0.00 0.00 0.93 8.80
nvme0n1 92.00 33124.00 0.00 0.00 2.70 360.04 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.25 7.60
nvme0n2 96.00 33804.00 0.00 0.00 2.41 352.12 1.00 4.00 0.00 0.00 0.00 4.00 0.00 0.00 0.00 0.00 0.00 0.00 0.23 8.40
nvme0n3 92.00 33256.00 0.00 0.00 2.41 361.48 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.22 6.80
nvme0n4 100.00 33056.00 0.00 0.00 2.19 330.56 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.22 6.80
avg-cpu: %user %nice %system %iowait %steal %idle
2.37 0.00 1.31 0.00 0.00 96.32
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz aqu-sz %util
md0 290.00 98304.00 0.00 0.00 3.52 338.98 4.00 56.00 0.00 0.00 0.00 14.00 0.00 0.00 0.00 0.00 0.00 0.00 1.02 6.80
nvme0n1 88.00 29924.00 0.00 0.00 3.08 340.05 1.00 32.00 0.00 0.00 0.00 32.00 0.00 0.00 0.00 0.00 0.00 0.00 0.27 5.60
nvme0n2 85.00 27560.00 0.00 0.00 2.79 324.24 2.00 16.00 0.00 0.00 0.50 8.00 0.00 0.00 0.00 0.00 0.00 0.00 0.24 6.00
nvme0n3 75.00 28412.00 0.00 0.00 2.77 378.83 1.00 8.00 0.00 0.00 0.00 8.00 0.00 0.00 0.00 0.00 0.00 0.00 0.21 5.60
nvme0n4 92.00 28792.00 0.00 0.00 2.63 312.96 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.24 5.20
We saw similar behavior on multiple clusters.
cc @tarzanek @vreniers @mkeeneyj
@vladzcloudius if I get it right, this is a node_exporter issue, right?
@vladzcloudius if I get it right, this is a node_exporter issue, right?
Could be.
@vladzcloudius could it be: https://github.com/prometheus/node_exporter/issues/2310