fio Performance drop when device identifier takes more than 100ms

Description of the bug: When storage identifier takes more than 100ms, there is Performance drop, especially Ran. Read IOPS (w/ multi-thread job) There is over 20% performance drop.

Environment: CentOS 8, Dell Server with PCI-e 4.0 NVMe

fio version: Latest code

Reproduction steps When I use this parameter, there is performance drop issue. --rw=randrw --rwmixread=100 If I use '--rw=randread' parameter, fio didn't do identifier so there is no problem.

When I increase waiting time for the started threads (backend.c 2513 line, 100ms -> more than 300ms) there is no issue, but Mr. Axboe want to fix with other ways.

Apr 04 '23 15:04 parkch0708

You're missing the most important big, WHY does it cause a performance drop when identify takes too long? The identify only happens on open, correct? And as far as I can tell, it's cached once it's done.

Apr 04 '23 15:04 axboe

You're missing the most important big, WHY does it cause a performance drop when identify takes too long? The identify only happens on open, correct? And as far as I can tell, it's cached once it's done.

I also expected the fio working like that. However, when performance drops begin to occur, performance recovery is not possible until the job is finished. (Increasing runtime is the same) Also, when performance drops occur, the fio process increases CPU core usage. (CPU core idle 0% in severe cases)

Apr 04 '23 15:04 parkch0708

Is there someone opening the device while the test is running? If it's not a cached identify, then it's quite possible it'll quiesce the OS level queue and hence cause io-wq activity from io_uring. I suspect this is what is causing the issue.

So I do think that it's likely that there's an issue here, but I also think it's important to fully understand this issue rather than try and work around it with random delays. What happens when someone runs your same test on a device that takes 0.5s to do identify?

Apr 04 '23 15:04 axboe

Is there someone opening the device while the test is running? If it's not a cached identify, then it's quite possible it'll quiesce the OS level queue and hence cause io-wq activity from io_uring. I suspect this is what is causing the issue.

So I do think that it's likely that there's an issue here, but I also think it's important to fully understand this issue rather than try and work around it with random delays. What happens when someone runs your same test on a device that takes 0.5s to do identify?

In my case, only fio use the device. I did the test with raw block device.

And I agree that increasing waiting time is not the core solution. If the identifier ends slower than the increased time, the same problem will occur again.

Apr 04 '23 15:04 parkch0708

And I agree that increasing waiting time is not the core solution. If the identifier ends slower than the increased time, the same problem will occur again.

Exactly. So please dig into this some more until you fully understand the issue.

Apr 04 '23 15:04 axboe

And I agree that increasing waiting time is not the core solution. If the identifier ends slower than the increased time, the same problem will occur again.

Exactly. So please dig into this some more until you fully understand the issue.

Thank you for your comment. I will try.

Apr 04 '23 15:04 parkch0708

fio fio copied to clipboard

Performance drop when device identifier takes more than 100ms

fio
fio copied to clipboard