fio icon indicating copy to clipboard operation
fio copied to clipboard

Multiple clients "All clients" section miscalculation

Open tomerg22 opened this issue 6 months ago • 8 comments

Please acknowledge the following before creating a ticket

  • [X] I have read the GitHub issues section of REPORTING-BUGS.

Description of the bug: "All clients" section miscalculation

Environment: Ubuntu 24.04

fio version: 3.36

Reproduction steps command:

fio --client ipa ./fio.cfg --client ipb ./fio.cfg  --output=out

fio.cfg:

[global]
ioengine=libaio
direct=1
iodepth=24
numjobs=1
group_reporting=1
new_group
bs=4k
size=100m
filename=/tmp/fio.data
[format]
bs=128k
rw=write

output:

  write: IOPS=6015, BW=752MiB/s (788MB/s)(100MiB/133msec); 0 zone resets
  write: IOPS=5882, BW=735MiB/s (771MB/s)(100MiB/136msec); 0 zone resets
All clients: (groupid=0, jobs=2): err= 0: pid=0: Thu Jun 12 12:38:19 2025
  write: IOPS=11.8k, BW=1471Mi (1542M)(200MiB/136msec); 0 zone resets

the values in the "All clients" section are actually a sum of the 2nd client values rather than client1+client2

Thanks.

tomerg22 avatar Jun 12 '25 12:06 tomerg22

Hello @tomerg22: I don't quite follow this:

the values in the "All clients" section are actually a sum of the 2nd client values rather than client1+client2

Is the first write line ("write: IOPS=6015...") in your output for the ipa client and the second write line ("write: IOPS=5882...") for your ipb client? You've truncated the fio output my interpretation may be wrong - could you clarify further?

sitsofe avatar Jun 12 '25 14:06 sitsofe

I see that IOPS 11.8k ~= 6018 + 5882 but BW=1471Mi != 752 + 735 and BW=1542M != 788 + 771.

Can you try the current version of fio? There have been some All clients changes since fio 3.36.

I examined the output for Test 11 from t/client_server.py and the All clients read BW does match the sum of the read BW values for the two clients.

vincentkfu avatar Jun 12 '25 14:06 vincentkfu

@sitsofe thats correct, the first line is the first client ipa, and the second is ipb. thats the same order they appear in the text output and in the json format as well.

@vincentkfu actually the IOPS is also 2 x ipb results = 11,764 = 11.8k which makes sense that its being calculated like the bw I also tested this with the latest which is 3.40 and looks the same

tomerg22 avatar Jun 12 '25 14:06 tomerg22

Can you share details from your 3.40 test?

For test 11 from the test script JSON output I see read BW 10419084 + 10508395 ~= 20927480.

  "client_stats" : [
    {
      "jobname" : "test",
      "groupid" : 0,
      "job_start" : 1749737182930,
      "error" : 0,
      "job options" : {
        "ioengine" : "null",
        "iodepth" : "2",
        "filesize" : "1T",
        "time_based" : "",
        "runtime" : "3s",
        "slat_percentiles" : "1",
        "clat_percentiles" : "1",
        "lat_percentiles" : "1"
      },
      "read" : {
        "io_bytes" : 32018096128,
        "io_kbytes" : 31267672,
        "bw_bytes" : 10669142328,
        "bw" : 10419084,
        "iops" : 2604771.076308,
...
    {
      "jobname" : "test",
      "groupid" : 0,
      "job_start" : 1749737182930,
      "error" : 0,
      "job options" : {
        "ioengine" : "null",
        "iodepth" : "2",
        "filesize" : "1T",
        "time_based" : "",
        "runtime" : "3s",
        "slat_percentiles" : "1",
        "clat_percentiles" : "1",
        "lat_percentiles" : "1"
      },
      "read" : {
        "io_bytes" : 32292552704,
        "io_kbytes" : 31535696,
        "bw_bytes" : 10760597368,
        "bw" : 10508395,
        "iops" : 2627098.967011,
...
    {
      "jobname" : "All clients",
      "groupid" : 0,
      "job_start" : 0,
      "error" : 0,
      "read" : {
        "io_bytes" : 64310648832,
        "io_kbytes" : 62803368,
        "bw_bytes" : 21429739697,
        "bw" : 20927480,
        "iops" : 5231870.043319,
...

vincentkfu avatar Jun 12 '25 14:06 vincentkfu

@vincentkfu yes, attached the file

out.json

tomerg22 avatar Jun 12 '25 19:06 tomerg22

Does this happen if you omit group_reporting=1?

vincentkfu avatar Jun 12 '25 19:06 vincentkfu

with this config it happens as well with 3.40

[global]
ioengine=libaio
direct=1
iodepth=24
numjobs=1
bs=4k
size=100m
filename=/tmp/fio.data
[format]
bs=128k
rw=write

tomerg22 avatar Jun 13 '25 08:06 tomerg22

Ok I've been able to reproduce this. I will not have more time to debug this today and I will be on vacation next week, so any fix will be a ways away.

vincent@fio-dev:~/fio-dev/1918/fio-canonical$ ./fio --client localhost test.fio --client localhost,8766 test.fio
hostname=fio-dev, be=0, 64-bit, os=Linux, arch=x86-64, fio=fio-3.40-6-g81344, flags=1
hostname=fio-dev, be=0, 64-bit, os=Linux, arch=x86-64, fio=fio-3.40-6-g81344, flags=1
<fio-dev> format: (g=0): rw=write, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=24
<fio-dev> format: (g=0): rw=write, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=24
<fio-dev> Starting 1 process
<fio-dev> Starting 1 process
<fio-dev> 
<fio-dev> 
format: (groupid=0, jobs=1): err= 0: pid=3509: Fri Jun 13 08:55:33 2025
  write: IOPS=22.2k, BW=2778MiB/s (2913MB/s)(100MiB/36msec); 0 zone resets
    slat (usec): min=17, max=6814, avg=43.23, stdev=289.28
    clat (nsec): min=922, max=7750.8k, avg=996210.06, stdev=1399151.84
     lat (usec): min=22, max=7792, avg=1039.44, stdev=1429.72
    clat percentiles (usec):
     |  1.00th=[  155],  5.00th=[  449], 10.00th=[  506], 20.00th=[  529],
     | 30.00th=[  537], 40.00th=[  545], 50.00th=[  570], 60.00th=[  652],
     | 70.00th=[  848], 80.00th=[  947], 90.00th=[ 1090], 95.00th=[ 5211],
     | 99.00th=[ 7701], 99.50th=[ 7701], 99.90th=[ 7767], 99.95th=[ 7767],
     | 99.99th=[ 7767]
  lat (nsec)   : 1000=0.12%
  lat (usec)   : 50=0.25%, 100=0.25%, 250=0.88%, 500=7.25%, 750=55.38%
  lat (usec)   : 1000=24.50%
  lat (msec)   : 2=5.62%, 10=5.75%
  cpu          : usr=17.14%, sys=54.29%, ctx=491, majf=0, minf=11
  IO depths    : 1=0.1%, 2=0.2%, 4=0.5%, 8=1.0%, 16=98.1%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=99.9%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,800,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=24

Run status group 0 (all jobs):
  WRITE: bw=2778MiB/s (2913MB/s), 2778MiB/s-2778MiB/s (2913MB/s-2913MB/s), io=100MiB (105MB), run=36-36msec
format: (groupid=0, jobs=1): err= 0: pid=3510: Fri Jun 13 08:55:33 2025
  write: IOPS=24.2k, BW=3030MiB/s (3178MB/s)(100MiB/33msec); 0 zone resets
    slat (usec): min=18, max=5314, avg=39.46, stdev=218.99
    clat (nsec): min=931, max=9335.0k, avg=909332.22, stdev=1136621.42
     lat (usec): min=22, max=9370, avg=948.79, stdev=1177.81
    clat percentiles (usec):
     |  1.00th=[  155],  5.00th=[  494], 10.00th=[  510], 20.00th=[  519],
     | 30.00th=[  523], 40.00th=[  529], 50.00th=[  545], 60.00th=[  578],
     | 70.00th=[  857], 80.00th=[  947], 90.00th=[ 1090], 95.00th=[ 4015],
     | 99.00th=[ 6128], 99.50th=[ 6128], 99.90th=[ 9372], 99.95th=[ 9372],
     | 99.99th=[ 9372]
  lat (nsec)   : 1000=0.12%
  lat (usec)   : 50=0.25%, 100=0.25%, 250=0.88%, 500=4.38%, 750=58.50%
  lat (usec)   : 1000=24.25%
  lat (msec)   : 2=6.00%, 10=5.38%
  cpu          : usr=0.00%, sys=75.00%, ctx=353, majf=0, minf=11
  IO depths    : 1=0.1%, 2=0.2%, 4=0.5%, 8=1.0%, 16=98.1%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=99.9%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,800,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=24

Run status group 0 (all jobs):
  WRITE: bw=3030MiB/s (3178MB/s), 3030MiB/s-3030MiB/s (3178MB/s-3178MB/s), io=100MiB (105MB), run=33-33msec
All clients: (groupid=0, jobs=2): err= 0: pid=0: Fri Jun 13 08:55:33 2025
  write: IOPS=44.4k, BW=5556Mi (5825M)(200MiB/36msec); 0 zone resets
    slat (usec): min=17, max=6814, avg=41.35, stdev=256.48
    clat (nsec): min=922, max=9335.0k, avg=952771.14, stdev=1275005.52
     lat (usec): min=22, max=9370, avg=994.12, stdev=1310.21
    clat percentiles (usec):
     |   1th=[  155],   5th=[  482],  10th=[  506],  20th=[  523],
     |  30th=[  529],  40th=[  537],  50th=[  553],  60th=[  627],
     |  70th=[  848],  80th=[  947],  90th=[ 1090],  95th=[ 4047],
     |  99th=[ 7635], 100th=[ 7701], 100th=[ 9372], 100th=[ 9372],
     | 100th=[ 9372]
  lat (nsec)   : 1000=0.12%
  lat (usec)   : 50=0.25%, 100=0.25%, 250=0.88%, 500=5.81%, 750=56.94%
  lat (usec)   : 1000=24.38%
  lat (msec)   : 2=5.81%, 10=5.56%
  cpu          : usr=8.96%, sys=64.18%, ctx=844, majf=0, minf=22
  IO depths    : 1=0.1%, 2=0.2%, 4=0.5%, 8=1.0%, 16=98.1%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=99.9%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,1600,0,0 short=0,0,0,0 dropped=0,0,0,0

vincentkfu avatar Jun 13 '25 12:06 vincentkfu

This seems not to be a bug.

What is happening is that for the "All clients" report fio is taking the max runtime of the two jobs (36ms) to calculate the IOPS. The two jobs both submit 800 IOs. So 1600 IOs / 36ms is simply double the IOPS from the slower job.

When two thread_stats objects are combined, runtimes are treated this way:

                       if (dst->runtime[l] < src->runtime[l])
                                dst->runtime[l] = src->runtime[l];

The counts of page faults and context switches in the cpu line are added as expected.

vincentkfu avatar Jun 23 '25 20:06 vincentkfu