fio icon indicating copy to clipboard operation
fio copied to clipboard

Segmentation Fault (core dumped) On Solaris 11.4

Open lic34 opened this issue 4 years ago • 10 comments

I tried to run parallel IO to 5 LUNs, the FIO profile likes below:

[global] ioengine=solarisaio thread iodepth=16 direct=0 bs_unaligned=0 time_based=1 rwmixwrite=50 rwmixread=50 do_verify=1 bsrange=4k-4k refill_buffers=0 runtime=2808 fill_device=1 numjobs=1 readwrite=randrw [public_lg_src_remote_20] filename=/dev/dsk/emcpower1c size=86% [public_lg_src_remote_21] filename=/dev/dsk/emcpower2c size=86% [public_lg_src_remote_22] filename=/dev/dsk/emcpower3c size=86% [public_lg_src_remote_23] filename=/dev/dsk/emcpower4c size=86% [public_lg_src_remote_24] filename=/dev/dsk/emcpower0c size=86%

It starts with 5 Jobs, but one minute later, there were only 4 Jobs, and the Job number decreased to 1 finally, even worst, the FIO end with core dumped:

root@ncvm9084105:/opt/csw/bin/fio-log# fio --output ./test.log ./test.fio clock setaffinity failed: Invalid argument Jobs: 4 (f=4): [m(2),X(1),m(2)][1.5%][r=868KiB/s,w=820KiB/s][r=217,w=205 IOPS][eta 46m:36s]

root@ncvm9084105:/opt/csw/bin/fio-log# fio --output ./test.log ./test.fio clock setaffinity failed: Invalid argument Segmentation Fault (core dumped)(1)][1.5%][r=1670KiB/s,w=1734KiB/s][r=417,w=433 IOPS][eta 47m:17s]

However, when I remove the parameter "thread" from the FIO profile, it works normally:

root@ncvm9084105:/opt/csw/bin/fio-log# cat test.fio [global] ioengine=solarisaio iodepth=16 direct=0 bs_unaligned=0 time_based=1 rwmixwrite=50 rwmixread=50 do_verify=1 bsrange=4k-4k refill_buffers=0 runtime=2808 fill_device=1 numjobs=1 readwrite=randrw [public_lg_src_remote_20] filename=/dev/dsk/emcpower1c size=86% [public_lg_src_remote_21] filename=/dev/dsk/emcpower2c size=86% [public_lg_src_remote_22] filename=/dev/dsk/emcpower3c size=86% [public_lg_src_remote_23] filename=/dev/dsk/emcpower4c size=86% [public_lg_src_remote_24] filename=/dev/dsk/emcpower0c size=86%

root@ncvm9084105:/opt/csw/bin/fio-log# fio --output ./test.log ./test.fio clock setaffinity failed: Invalid argument Jobs: 5 (f=5): [m(5)][14.8%][r=2361KiB/s,w=2361KiB/s][r=590,w=590 IOPS][eta 39m:56s]

lic34 avatar Jul 11 '19 13:07 lic34

@lic34 we're going to need to see a backtrace that includes all the threads (thread apply all bt) from the point of the crash to even start looking at this one...

sitsofe avatar Jul 12 '19 13:07 sitsofe

@lic34, are you by chance running this as root or with sudo, or as a normal user? To make this a bit more clear, I noticed that at least in my tests when I don't run as root, this does not succeed: https://github.com/axboe/fio/blob/de5ed0e4d398bc9d4576f9b2b82d7686989c27e1/os/os-solaris.h#L151

A simple test case:

# No root here
$ ./pset-create-test
result: -1
pset_create: Not owner

# With sudo
$ sudo ./pset-create-test
result: 0
pset_create: Error 0

My very simple test driver for this:

#include <stdio.h>
#include <sys/pset.h>
#include <errno.h>
#include <string.h>

int main(void) {
    psetid_t newpset = 0;
    int res = pset_create(&newpset);

    printf("result: %d\n", res);
    perror("pset_create");
}

I am wondering if we are failing to do some things because we do not have a particular level of access.

szaydel avatar Jul 14 '19 23:07 szaydel

I am not necessarily suggesting that it is all a perms thing, but wanted to see if you have been doing this with elevated permissions, and if not, if it were possible as a test.

szaydel avatar Jul 14 '19 23:07 szaydel

@lic34, are you experiencing this failure with solarisaio very consistently or is it intermittent? I think I am reproducing this problem, but not consistently. I have to re-run the test several times before I trigger it, but do suspect problem is same as what you are having. I am fairly sure a lot of this aio code on illumos and solaris is nearly same, likely having been stable for a long time, so I am guessing something about more CPUs, etc., could be why it is more consistent for you, if it is indeed consistent.

szaydel avatar Jul 21 '19 14:07 szaydel

@szaydel, Thanks for your support! If there is anything I can help, please feel free let me know.

lic34 avatar Jul 24 '19 07:07 lic34

@lic34 just clarify @szaydel was asking you the following (I've reworded things based on my interpretation):

  • Is the failure you see with solarisaio intermittent or constant?

sitsofe avatar Jul 24 '19 15:07 sitsofe

@sitsofe, thanks - that's exactly what I meant. :)

szaydel avatar Jul 25 '19 03:07 szaydel

Sorry for my late replay. It seems the core dump was not hit each time, but the issue of "job number decreased to 1 in a short time after FIO starts" was a constant issue.

lic34 avatar Jul 29 '19 12:07 lic34

Thanks @lic34. I did not observe this decrease, but at least I am reproducing the periodic crashes. I am going to see if I can do something about it when I find some free time.

szaydel avatar Jul 31 '19 03:07 szaydel

@lic34 is this one still happening with the latest fio releases? If so do you think you could post a backtrace of the crashes? Thanks!

sitsofe avatar Jan 16 '21 11:01 sitsofe