fio icon indicating copy to clipboard operation
fio copied to clipboard

FIO throw out error "clock setaffinity failed: Invalid argument" before it starts to run on Solaris 11.4

Open lic34 opened this issue 4 years ago • 21 comments

FIO profile

[global] ioengine=solarisaio iodepth=16 direct=1 bs_unaligned=0 time_based=1 rwmixwrite=50 rwmixread=50 do_verify=1 bsrange=4k-4k refill_buffers=0 thread runtime=2808 fill_device=1 numjobs=1 readwrite=randrw [public_lg_src_remote_23] filename=/dev/dsk/emcpower4c size=86% [public_lg_src_remote_24] filename=/dev/dsk/emcpower0c size=86%

Error:

Assertion failed: io_u->flags & IO_U_F_FLIGHT, file io_u.c, line 1374, function io_completed Abort (core dumped)

However, when it performs IO to one LUN, it works properly.

lic34 avatar Jul 09 '19 12:07 lic34

@lic34 I'm afraid we would need at least the full backtrace to be able to take this one further. If it's somewhat deterministic then we might be able to turn debug=io on too for some more clues...

sitsofe avatar Jul 09 '19 12:07 sitsofe

Get a backtrace of it (I don't have a solaris system so the core itself wouldn't be too useful). Something like

gdb --core=path/to/core path/to/fio/binary
thread apply all bt
quit

And copy pasting the backtrace here would do (note those commands are a rough guess).

sitsofe avatar Jul 09 '19 12:07 sitsofe

The backtrace is as following:(Moreover, I'm using FIO version with 2.0.14)

(gdb) bt #0 0xfdf5db85 in __nanosleep () from /lib/libc.so.1 #1 0xfdf48943 in nanosleep () from /lib/libc.so.1 #2 0xfdf48ce3 in usleep () from /lib/libc.so.1 #3 0x0808edf3 in ?? () #4 0x0808f1ab in fio_backend () #5 0x08094ac6 in main ()

lic34 avatar Jul 09 '19 13:07 lic34

@lic34:

Moreover, I'm using FIO version with 2.0.14

Ah that is important! I'm afraid we (upstream fio) don't support any version of fio other than the latest (at this stage your version of fio is over six years old). I would strongly suggest seeing if the problem happens in newer version of fio (see https://github.com/axboe/fio/releases for which release we're up to now).

sitsofe avatar Jul 09 '19 13:07 sitsofe

I got FIO with version 2.0.14 from this page [https://www.opencsw.org/packages/fio/] for my Solaris 11.4 host, but you recommend me to use the latest version. So, what's I'm concerned is that, does the FIO with the latest version support Solaris 11.4?

lic34 avatar Jul 09 '19 13:07 lic34

@lic34 Theoretically the latest one if you apply the fixes mentioned in #714 ...

sitsofe avatar Jul 09 '19 15:07 sitsofe

@lic34, I have not had issues building recent checkouts from master on Illumos, and while that does not guarantee anything, I expect that it should build on Solaris as well. If it does not, perhaps it is worth logging an issue with details from failure and I or someone else may be able to help.

szaydel avatar Jul 10 '19 03:07 szaydel

@sitsofe @szaydel thank you for your help~!!

I tried to use the latest version(3.14) on my Solaris host, and I hit the same issue as #714,

I changed the code as @sitsofe's suggestion:

changing

#define os_ctime_r(x, y, z) ctime_r((x), (y), (z))

to

#define os_ctime_r(x, y, z) ctime_r((x), (y))

It installs successfully.

However, when I run the FIO process, it reported an error as

clock setaffinity failed: Invalid argument

lic34 avatar Jul 10 '19 05:07 lic34

@lic34 Is there any more context to go along with that? I'm going to guess it's something to do with https://github.com/axboe/fio/blob/a0318e176ace1e53b0049965f63b0c60b55ff8cd/gettime.c#L588 so by extension https://github.com/axboe/fio/blob/de5ed0e4d398bc9d4576f9b2b82d7686989c27e1/os/os-solaris.h#L108 but we'd have to know which argument was bad and what the correct thing looks like...

sitsofe avatar Jul 10 '19 06:07 sitsofe

It seems the FIO keep running after the error throwed out, however there is only one process(job) was running, even there are more than two Luns in the profile.

root@ncvm9084105:/opt/csw/bin/fio-log# fio --output /opt/csw/bin/fio-log/test.log /opt/csw/bin/fio-log/test.fio clock setaffinity failed: Invalid argument Jobs: 1 (f=1): [X(1),m(1)][7.4%][r=16.8MiB/s,w=16.7MiB/s][r=4305,w=4280 IOPS][eta 43m:22s]

lic34 avatar Jul 10 '19 06:07 lic34

@lic34 Hmm... looking through https://docs.oracle.com/cd/E88353_01/html/E37841/pset-bind-2.html I wonder if fio is trying to bind to a cpu that your system has disallowed (I'm guessing that call is returning EINVAL). Could you simplify your job file further (i.e. remove as much as possible) so we have the bare minimum that still makes the current failing job issue happen?

(I'm likely busy for the rest of the day but hopefully others will chip in if they can)

sitsofe avatar Jul 10 '19 06:07 sitsofe

(Mentioning @alanc as a long shot)

sitsofe avatar Jul 10 '19 06:07 sitsofe

I simplified the profile likes:

_ [global] ioengine=solarisaio iodepth=16 direct=0 bs_unaligned=0 time_based=1 rwmixwrite=50 rwmixread=50 do_verify=1 bsrange=4k-4k refill_buffers=0 thread runtime=2808 fill_device=1 numjobs=1 readwrite=randrw [public_lg_src_remote_20] filename=/dev/dsk/emcpower1c size=86% _

However, it seems it has the same issue:

clock setaffinity failed: Invalid argument Jobs: 1 (f=1): [m(1)][0.2%][r=12.9MiB/s,w=12.9MiB/s][r=3293,w=3296 IOPS][eta 46m:43s]

lic34 avatar Jul 10 '19 06:07 lic34

@lic34 if you keep both jobs and remove options like direct, fill_device, refill_buffers, bs_unaligned, thread, numjobs (basically as many options as possible) etc. what's the smallest job that results in the premature death of one of the threads?

sitsofe avatar Jul 10 '19 08:07 sitsofe

I removed all options as following:

[global] ioengine=solarisaio [public_lg_src_remote_20] filename=/dev/dsk/emcpower1c size=86% [public_lg_src_remote_21] filename=/dev/dsk/emcpower2c size=86% [public_lg_src_remote_22] filename=/dev/dsk/emcpower3c size=86% [public_lg_src_remote_23] size=86% filename=/dev/dsk/emcpower4c size=86% [public_lg_src_remote_24] filename=/dev/dsk/emcpower0c size=86%

But the error is still here:

# fio --output /opt/csw/bin/fio-log/id_1562061394_912506_4191_25.log /opt/csw/bin/fio-log/id_1562061394_107510_4191_25.fio clock setaffinity failed: Invalid argument Jobs: 5 (f=5): [R(5)][3.0%][r=50.9MiB/s][r=13.0k IOPS][eta 01h:12m:51s]

lic34 avatar Jul 10 '19 08:07 lic34

I recall seeing something to this end. I will see if it happens on illumos with multiple cores, where I think I observed it also, and if I have some spare time I will see if I can get Solaris going also.

szaydel avatar Jul 10 '19 19:07 szaydel

Thanks @szaydel for your support! Moreover, I find another issue when I was using the FIO on my Solaris 11.4, please refer to #798 for more details.

lic34 avatar Jul 11 '19 13:07 lic34

Segfault is not something I expected to see, but maybe it is somehow related to this issue. I will try to mess with it this weekend.

szaydel avatar Jul 12 '19 03:07 szaydel

What happens if you change to say posixaio, and still use threads?

szaydel avatar Jul 12 '19 03:07 szaydel

Sorry for my late reply.

I change the IO engine from solarisaio to posixaio and add thread, it seems the fio is working property after throwing an error as "clock setaffinity failed: Invalid argument".

The output when the process begins run:

root@ncvm9084105:/opt/csw/bin/fio-log# fio --output debug.log debug.fio clock setaffinity failed: Invalid argument Jobs: 5 (f=5): [m(5)][24.4%][r=6782KiB/s,w=6898KiB/s][r=1695,w=1724 IOPS][eta 35m:22s]

FIO parameters:

root@ncvm9084105:/opt/csw/bin/fio-log# cat debug.fio [global] ioengine=posixaio iodepth=16 direct=0 thread bs_unaligned=0 time_based=1 rwmixwrite=50 rwmixread=50 do_verify=1 bsrange=4k-4k refill_buffers=0 runtime=2808 fill_device=1 numjobs=1 readwrite=randrw [public_lg_src_remote_20] filename=/dev/dsk/emcpower1c size=86% [public_lg_src_remote_21] filename=/dev/dsk/emcpower2c size=86% [public_lg_src_remote_22] filename=/dev/dsk/emcpower3c size=86% [public_lg_src_remote_23] filename=/dev/dsk/emcpower4c size=86% [public_lg_src_remote_24] filename=/dev/dsk/emcpower0c size=86%

lic34 avatar Jul 17 '19 01:07 lic34

I apologize, myself. I have been really busy over these last few days. I am not reproducing this issue on illumos, which means, I think the problem is specific to Solaris. If I have a little bit of time, I will try with Solaris 11.4. I think the clock affinity error and the crash you were seeing are two separate things, and it would be useful to figure out what is really causing that crash.

szaydel avatar Jul 20 '19 21:07 szaydel