fio icon indicating copy to clipboard operation
fio copied to clipboard

When doing data validation with read_iolog (based on earlier write_iolog run) i get seg fault

Open dori-vlz opened this issue 6 months ago • 3 comments

Summary

FIO crashes with a segmentation fault when running data validation using verify_only=1 and read_iolog on a file system

Reproduction Steps

FIO version: fio-3.36 Platform: Linux Ubuntu 24.04, Kernel 6.8.1029

1st job format 1G file with pattern 0xAA:

[global]
create_serialize=0
numjobs=1
iodepth=16
size=1g
group_reporting=1
file_service_type=random
directory=/mnt/volumez/attvol0
filename_format=fiodata.$jobnum
verify_dump=1
ioengine=libaio
exitall_on_error=1
end_fsync=1
stonewall=1
fallocate=none
lat_percentiles=1
max_latency=45s
direct=1
do_verify=1
verify=pattern
verify_pattern=0xAA
[seq_0_100_128k]
bs=128k
rw=write

2nd job doing random write to the 1G file with pattern 0xBB (no overwrites):

[global]
create_serialize=0
numjobs=1
iodepth=16
size=1g
group_reporting=1
file_service_type=random
directory=/mnt/volumez/attvol0
filename_format=fiodata.$jobnum
verify_dump=1
ioengine=libaio
exitall_on_error=1
end_fsync=1
stonewall=1
fallocate=none
lat_percentiles=1
max_latency=45s
direct=1
do_verify=1
verify=pattern
verify_pattern=0xBB
norandommap=0
write_iolog=rand_write_blocks.log
[rand_0_100_128k]
bs=128k
rw=randwrite

While 2nd run is running i kill FIO so not all 1GB was modified with new pattern 0xBB

3rd job running data validation based on write log wanting to verify all written pattern 0xBB:

[global]
create_serialize=0
numjobs=1
iodepth=16
size=1g
group_reporting=1
file_service_type=random
directory=/mnt/volumez/attvol0
filename_format=fiodata.$jobnum
verify_dump=1
ioengine=libaio
exitall_on_error=1
end_fsync=1
stonewall=1
fallocate=none
lat_percentiles=1
max_latency=45s
direct=1
do_verify=1
verify=pattern
verify_pattern=0xBB
norandommap=0
verify_only=1
read_iolog=rand_write_blocks.log
[valid_0_100_128k]
bs=128k
rw=read

Error:

root@i-03e0e3c7af179cea6:~# fio random_validate
seq_0_100_128k: (g=0): rw=read, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=libaio, iodepth=16
fio-3.36
Starting 1 process
fio: pid=227701, got signal=11


Run status group 0 (all jobs):
free(): double free detected in tcache 2
Aborted (core dumped)

From dmesg i see -

[Wed Jun 11 08:17:10 2025] fio[227701]: segfault at 195 ip 00005e3781b96d63 sp 00007ffcf83f1c90 error 4 in fio[5e3781b85000+8d000] likely on CPU 3 (core 1, socket 0)
[Wed Jun 11 08:17:10 2025] Code: e8 32 fc fe ff 66 90 f3 0f 1e fa 55 48 89 e5 41 56 41 55 41 54 53 48 89 f3 48 83 ec 10 64 48 8b 04 25 28 00 00 00 48 89 45 d8 <8b> 86 74 01 00 00 a8 02 0f 85 7f 01 00 00 a8 01 0f 85 24 05 00 00

Expected Results: Validation should complete without crashing.

BTW i tried this also on a block device (raw) and there i get different results, there validation seems to always pass even if i give invalid pattern but this i guess a different issue.

Regards, Doron Tal

dori-vlz avatar Jun 11 '25 08:06 dori-vlz

Hello @dori-vlz:

One big question: How are you killing the fio writing the iolog file? If you somehow have incomplete iolog records then the fio trying to read from the iolog is going to get tripped up. You would need a far more complex logging system that somehow was able to indicate which records were fully complete to support such a scenario.

To narrow the problem down I think we're going to need additional work and more information:

  • Does this segfault happen 100% of the time?
  • Can you rebuild fio from the latest source (note the requirements to support the libaio) and still make the problem happen?
  • Can you minimise the job file and command line options (it's important to know them all) such that you have the smallest amount that still reproduce the issue. Don't stop at the first option that is required, put it back and then try to remove the next option and so on.
  • Can you make the problem happen on the remaining options set at their smallest possible values? Make size as small as possible, make the iodepth as small as possible such that the problem still happens etc. Don't stop at just those options - reduce as many remaining options as you can.
  • Can you make the problem happen with thread set? If so can perform the run that segfaults with that set under gdb so you can get a backtrace and post it here?

At the bare minimum I think we need to see a sane backtrace to find out just where in the code the problem happens or the ability to reproduce the problem ourselves so it can be debugged locally.

sitsofe avatar Jun 11 '25 14:06 sitsofe

Thx for your reply and sorry for the delayed response. I will adress some of your items and will work on testing others.

  1. This issue is occurring 100% of the time
  2. I tried latest buikld and still got it
  3. I will try to reduce the options and retry

In addition i was consulting AI and seems the issue is with the log file type. It was said that this capability requires classic iolog write job format while my runs create a file with event-based v3 iolog format

dori-vlz avatar Jun 16 '25 08:06 dori-vlz

@dori-vlz: Would you be able to add the answer to this question:

How are you killing the fio writing the iolog file?

As the issue is happening 100% of the time could you also attach rand_write_blocks.log as file to this issue?

there validation seems to always pass even if i give invalid pattern but this i guess a different issue

Yes that sounds separate so it should be filed as separate issue. As with this issue, please cut down the problem jobs down to the bare minimum that still demonstrates the issue before filing.

sitsofe avatar Jun 16 '25 09:06 sitsofe

Closing due to lack of reply from reporter. If this issue is still happening with the latest fio (see https://github.com/axboe/fio/releases to find out which version that is) please reopen. Thanks!

sitsofe avatar Jul 21 '25 20:07 sitsofe