fio FIO report fail when 112 * HDDs run FIO with 32jobs iodepth 32

Please acknowledge the following before creating a ticket

[Y ] I have read the GitHub issues section of REPORTING-BUGS.

Description of the bug: There are 112 Hard Drives in one system. Run below FIO parameter with 112 HDDs fio --name=${job_name} --filename=/dev/"$DEV" --ioengine=libaio --direct=1 --thread=1 --numjobs=32 --iodepth=32 --rw=write --bs=128k --runtime=3600 --time_based=1 --size=100% --group_reporting --log_avg_msec=1000 --bwavgtime=1000 Report: fio: job startup hung? exiting.

It can successful when we change the "--numjobs=32" to "--numjobs=16".

How much the IO number support in FIO?

Environment: Broadcom SAS Expander：11.06.06.03

HBA: 9600-24i FW: 8.13.2.0 Driver:8.13.1.0.0

CPU Type: Intel(R) Xeon(R) 6519P-C Memory Type: Samsung M321R8GA0EB2-CCPWC ，64GB *32

HDD Type: Seagate ST32000NM004K-3U FW:SE02 *112

OS Debian 10.11 fio version: 3.40

Reproduction steps

Run below FIO parameter with 112 HDDs fio --name=${job_name} --filename=/dev/"$DEV" --ioengine=libaio --direct=1 --thread=1 --numjobs=32 --iodepth=32 --rw=write --bs=128k --runtime=3600 --time_based=1 --size=100% --group_reporting --log_avg_msec=1000 --bwavgtime=1000

Aug 11 '25 11:08 Hank-Zhao201209

Hello @Hank-Zhao201209:

I can tell you that the version of fio you are using is not 4.40 because at the time of writing version 3.40 is the latest (see https://github.com/axboe/fio/releases ) - is that the version you meant?

Can you reduce the job file/command line options such that you have the smallest amount that still reproduce the issue. Remove each option in turn and see if the problem continues to happen if the problem still happens leave that option out. If an option is required don't stop at that option, put it back and then try to remove the next option and so on.

It would also be help to know the exact numjobs between 18 to 32 the problem starts happening at.

Aug 11 '25 13:08 sitsofe

Hi sitsofe, Yes, it's 3.40. OK, we can try your suggestion. Thank you! Does the tool have the limitation of max io tasks in design?

Aug 12 '25 03:08 Hank-Zhao201209

@Hank-Zhao201209:

Does the tool have the limitation of max io tasks in design?

No, not in that area by design. Maximum job limits should only be reached due to resource exhaustion. Just to check what environment are you running this in? Knowing that you're using a SAS expander isn't what we mean, we need things like OS distribution, kernel version, whether you're running on bare metal or in a container etc.

Aug 12 '25 06:08 sitsofe

Debian 10.13, kernel: 5.15.152.ve.8-amd64 it's customer own kernel. And this script runs on bare metal, not container or VM.

Aug 12 '25 07:08 Hank-Zhao201209

@Hank-Zhao201209: For what it's worth, I've run

fio --name=job --filename=/dev/nullb0 --ioengine=libaio --direct=1 --thread=1 --numjobs=32 --iodepth=32 --rw=write --bs=128k --runtime=3600 --time_based=1 --size=100% --group_reporting --log_avg_msec=1000 --bwavgtime=1000

On a machine with 16G of RAM without issue. Are you saying that you're launching 112 fio invocations in a for loop and all of them are running simultaneously?

Aug 12 '25 09:08 sitsofe

Yes, this is the script to call the fio script. CASE=$1 TIME=$2

for dev in lsscsi |grep WUH |awk '{print $6}' |awk -F/ '{print $3}' do nohup sh ./$CASE $dev $TIME & done

Aug 13 '25 01:08 Hank-Zhao201209

@sitsofe: After test, it will return fail when the jobs set to 28.

nohup128k16jobsOk.txt nohup128k18jobsOk.txt nohup128k20jobsOk.txt nohup128k24jobsOk.txt nohup128k26jobsOk.txt nohup128k27jobsOk.txt nohup128k28jobsFail.txt nohup128k32jobsFail.txt

Aug 15 '25 08:08 Hank-Zhao201209

@Hank-Zhao201209: I've run

modprobe null_blk nr_devices=112
for i in {0..111}; do
  fio --minimal --name=job --allow_file_create=0 --filename=/dev/nullb$i --ioengine=libaio --direct=1 --thread=1 --numjobs=32 --iodepth=32 --rw=write --bs=128k --runtime=20s --time_based=1 --size=100% --group_reporting --log_avg_msec=1000 --bwavgtime=1000 &
done
wait

on a machine with 30GBytes of RAM. I had to tune aio-max-nr to be higher (I did

echo "131072" > /proc/sys/fs/aio-max-nr

) to avoid hitting errors like:

fio: pid=27309, err=11/file:engines/libaio.c:407, func=io_queue_init, error=Resource temporarily unavailable

but all jobs ran to completion. Is it possible for you to record stdout/stderr of running the above on your machine and then search the logged output for any errors? If you do find an error would you be able to include just the relevant part here in this ticket (and possible attach the full log as a file attachment)?

Aug 15 '25 11:08 sitsofe

Hi @sitsofe , Yes, it's same with you! The only different with you is the message of "libaio.c:443" in fio version 3.40. But this error message is not same with the issue which we meet now.

Aug 18 '25 03:08 Hank-Zhao201209

Itv@Hank-Zhao201209:

It was worth a try. OK back to my earlier request:

Can you reduce the job file/command line options such that you have the smallest amount that still reproduce the issue. Remove each option in turn and see if the problem continues to happen if the problem still happens leave that option out. If an option is required don't stop at that option, put it back and then try to remove the next option and so on.

I appreciate some options may just be required (e.g. filename time_based and runtime) but you can cut as much as possible that reduces the places to search in. It is also worth reducing values in options like iodepth (to 1) and bs (to 4k) so we can rule those out as having an impact. Finally after you have done all the above can you try switching the ioengine to the default psync and see if the problem still happens.

Aug 18 '25 07:08 sitsofe

@sitsofe the small block(like 4k) and lower iodepth can be successful. but the test sample will use the 128k block with 32 iodepth. Can you help to check root cause with these parameters?

Aug 27 '25 01:08 Hank-Zhao201209

@Hank-Zhao201209: The problem is I can't reproduce the problem you are seeing locally (likely because I don't have access to the resources and environment you do). Due to this, I'm thinking about trying to work out where your problem lies via code inspection but doing this is time intensive and I have very little time to dedicate to this (perhaps half an hour tops). The more parameters there are the more places that need to be searched and when the search space is too large I will run out of my time budget to investigate further...

This leaves us with two options:

You find a way to make the issue happen on synthetic devices (like the nullblk one I tried earlier) so that I can reproduce the problem locally.
Even though it makes your jobs less realistic, for debugging purposes you identify every option that doesn't impact the reproducibility of the issue and minimize the values of the rest. For example, is log_avg_msec needed to reproduce the problem? Remove it, see if the problem happens. If it does put it back and then try and remove another option and so on. If doesn't impact leave it removed and then move on to the next option.

Again I ask: does the problem happen when you are using the psync ioengine? Does the problem happen with thread=0 It's interesting that smaller block sizes are successful as this hints that perhaps you're running out of some memory related resource...

Just to be clear even if we identify the root cause I'm not committing to fixing the issue (because my time is limited) but at least we can document it and that allows someone else to take it further.

Aug 27 '25 07:08 sitsofe

@sitsofe : The psync ioengine is fail too. And it still fail when the thread=0, and the fio can not runing in the ps -aux.

nohup_thread0_28jobs_fail.txt nohup_psync_28jobs_fail.txt

psaux_jobs28_thread0_306fioTask.txt

Aug 29 '25 01:08 Hank-Zhao201209

This is lest parameter: fio --name=${job_name} --filename=/dev/$DEV --ioengine=libaio --direct=1 --thread=1 --numjobs=28 --iodepth=32 --rw=write --bs=128k --runtime=200 And we found if we change the direct from 1 to 0, the fio can be run with 112 * HDDs.

Aug 29 '25 12:08 Hank-Zhao201209

@Hank-Zhao201209: You said you can also change the ioengine to psync? If so the problem also happens when ioengine is omitted? Additionally you said it happens with thread=0 so that can be omitted too? Just to check there aren't any relevant messages in dmesg when this issue happens?

Aug 29 '25 15:08 sitsofe

@Hank-Zhao201209: There's something strange in the logs when things start going wrong:

stat: No such file or directory
stat: No such file or directory
stat: No such file or directory
fio: job startup hung? exiting.
fio: 27 jobs failed to start
fio: job startup hung? exiting.
fio: 15 jobs failed to start

Are you using client/server mode?

Aug 29 '25 15:08 sitsofe

Hi @sitsofe : There is no abnormal dmesg print when the command fail. Yes, the "thread" and "ioengine" is not the key parameter, they can be omitted. The message of "stat: No such file or directory" is the print message of the script. I'm not sure about that why direct=0 can be run and the direct=1 can not be run.

Sep 01 '25 01:09 Hank-Zhao201209

hi @sitsofe :

We found the fail message in backend.c -> run_threads() -> fio_sem_down_timeout(), and the 10s timeout is not enough for so much fio task, we change it from 10000 to 50000, the 32 jobs fail condition will pass with this change. And I want to check with you that the 10000 is from which spec?

Another method can be passed too. Add the --debug=[ all | file | mem ] can be pass too, and the --debug=process will still fail. I think the debug mode will cause the timing issue.

Sep 02 '25 01:09 Hank-Zhao201209

@Hank-Zhao201209:

And I want to check with you that the 10000 is from which spec?

I don't think the value is from any spec it's just an arbitrary choice. Digging through the git history it looks like the value was introduced with 656b1393d43f9f22738404582ea14dec956aea83 ("Add code to detect a task that exited prior to up'ing the startup mutex") and at that point the value was set to 10 ~~milliseconds~~ seconds.

the --debug=[ all | file | mem ] can be pass too, and the --debug=process will still fail.

That is sounds really strange because --debug=all turns on --debug=process! Are you sure --debug=all also works?

One more query, if you use the same device for all jobs (e.g. --filename=/dev/sdd) can you still reproduce the problem? ~~Also how powerful is your CPU - is it something embedded?~~ (You've already mentioned it's a Xeon(R) 6519P-C which should be pretty fast)

Sep 02 '25 07:09 sitsofe

@Hank-Zhao201209:

And I want to check with you that the 10000 is from which spec?

I don't think the value is from any spec it's just an arbitrary choice. Digging through the git history it looks like the value was introduced with 656b139 ("Add code to detect a task that exited prior to up'ing the startup mutex") and at that point the value was set to 10 milliseconds.

the --debug=[ all | file | mem ] can be pass too, and the --debug=process will still fail.

That is sounds really strange because --debug=all turns on --debug=process! Are you sure --debug=all also works?

Yes, at the first, we use the "process" level, and it still fail, so we use the "all" level to retry, it pass, so we tried the file and mem level to verify, it still pass. It seems the timing issue, these logical will cause more time to judge the print conditions or too much print command.

One more query, if you use the same device for all jobs (e.g. --filename=/dev/sdd) can you still reproduce the problem? ~Also how powerful is your CPU - is it something embedded?~ (You've already mentioned it's a Xeon(R) 6519P-C which should be pretty fast)

We will try later, because the system is testing another cycle issue.

Sep 02 '25 10:09 Hank-Zhao201209

Hi @sitsofe : We also use another method to avoid this issue,

run the 37 HDDs FIO task
sleep 50s
run the other 38 HDDs FIO task
sleep 50s
run the last 37 HDDs FIO task

Sep 02 '25 10:09 Hank-Zhao201209

@Hank-Zhao201209:

Re staggering jobs/waiting longer/using --debug=all avoiding the issue: this seems to suggest the problem is some sort of contention point that resolves itself with time... I'm surprised I can't reproduce the issue myself though. Does using --debug=process,mutex also avoid the issue?

Sep 02 '25 13:09 sitsofe

@sitsofe : I will try to using --debug=process,mutex to test it tomorrow. The system is running the long time FIO testing now. Maybe the OS have no enough resource to start up the new create fio tasks in 10ms when there are a lot of new fio tasks create. So I modified the timeout value for the judgement in above message.

Sep 04 '25 12:09 Hank-Zhao201209

@sitsofe : Added these two parameters the fio can run successful when total 112 run together.

Sep 04 '25 13:09 Hank-Zhao201209

@Hank-Zhao201209:

Added these two parameters the fio can run successful when total 112 run together.

OK that's a shame. Based on your support code snippet above (FYI: a text copy is preferable to screenshots) at any one time are you only running one fio invocation at any one time but you still see a failure with something other than the first one?

Were you still able to reproduce the problem when using the same device for all jobs with the reduced command line wrapped in your looping code

fio --name=${job_name} --filename=/dev/$SAMEDEV --direct=1 --thread=1 --numjobs=32 --rw=write --bs=128k --runtime=200

as you mentioned above?

(FYI: be careful with timings where all your jobs are working with the same part of a file at the same time. From https://github.com/axboe/fio/discussions/1903#discussioncomment-13243560 :

That "single file, multiple jobs" infers that you will be running 16 jobs over the same 1g region which means they can interfere with each other and because you're doing writes that means certain writes may end up being thrown away. For example, [numjob1] writes the first 4k but then [numjob2] immediately writes the same 4k before the job1's 4k gets all the way to non-volatile storage. Due to something in your storage stack doing an optimisation, [numjob1]'s 4k may be thrown away and success returned because it's already been replaced! To sidestep this issue you may want to arrange for different jobs to write to different regions of the same file by using offset_increment.

)

Sep 04 '25 16:09 sitsofe

@sitsofe :

Were you still able to reproduce the problem when using the same device for all jobs with the reduced command line wrapped in your looping code

Yes, it's 100% to reproduce.

(FYI: be careful with timings where all your jobs are working with the same part of a file at the same time. From https://github.com/axboe/fio/discussions/1903#discussioncomment-13243560 :)

Our fail result is fio task not running successful, we don't care about the [numjob1]'s 4k may be thrown away, because it can return success.

Sep 08 '25 06:09 Hank-Zhao201209