fio
fio copied to clipboard
Performance drop when device identifier takes more than 100ms
Description of the bug: When storage identifier takes more than 100ms, there is Performance drop, especially Ran. Read IOPS (w/ multi-thread job) There is over 20% performance drop.
Environment: CentOS 8, Dell Server with PCI-e 4.0 NVMe
fio version: Latest code
Reproduction steps When I use this parameter, there is performance drop issue. --rw=randrw --rwmixread=100 If I use '--rw=randread' parameter, fio didn't do identifier so there is no problem.
When I increase waiting time for the started threads (backend.c 2513 line, 100ms -> more than 300ms) there is no issue, but Mr. Axboe want to fix with other ways.
You're missing the most important big, WHY does it cause a performance drop when identify takes too long? The identify only happens on open, correct? And as far as I can tell, it's cached once it's done.
You're missing the most important big, WHY does it cause a performance drop when identify takes too long? The identify only happens on open, correct? And as far as I can tell, it's cached once it's done.
I also expected the fio working like that. However, when performance drops begin to occur, performance recovery is not possible until the job is finished. (Increasing runtime is the same) Also, when performance drops occur, the fio process increases CPU core usage. (CPU core idle 0% in severe cases)
Is there someone opening the device while the test is running? If it's not a cached identify, then it's quite possible it'll quiesce the OS level queue and hence cause io-wq activity from io_uring. I suspect this is what is causing the issue.
So I do think that it's likely that there's an issue here, but I also think it's important to fully understand this issue rather than try and work around it with random delays. What happens when someone runs your same test on a device that takes 0.5s to do identify?
Is there someone opening the device while the test is running? If it's not a cached identify, then it's quite possible it'll quiesce the OS level queue and hence cause io-wq activity from io_uring. I suspect this is what is causing the issue.
So I do think that it's likely that there's an issue here, but I also think it's important to fully understand this issue rather than try and work around it with random delays. What happens when someone runs your same test on a device that takes 0.5s to do identify?
In my case, only fio use the device. I did the test with raw block device.
And I agree that increasing waiting time is not the core solution. If the identifier ends slower than the increased time, the same problem will occur again.
And I agree that increasing waiting time is not the core solution. If the identifier ends slower than the increased time, the same problem will occur again.
Exactly. So please dig into this some more until you fully understand the issue.
And I agree that increasing waiting time is not the core solution. If the identifier ends slower than the increased time, the same problem will occur again.
Exactly. So please dig into this some more until you fully understand the issue.
Thank you for your comment. I will try.