fio icon indicating copy to clipboard operation
fio copied to clipboard

`fio` stops when a single path fails in a multipath LUN when using rate_iops

Open it800 opened this issue 9 months ago • 1 comments

Please acknowledge the following before creating a ticket

  • [X] I have read the GitHub issues section of REPORTING-BUGS.

Description of the bug: During testing with fio, we encountered and repeatedly confirmed a scenario that we consider to be a defect. The test was performed on a storage system connected over iSCSI. The LUN was available via 8 paths, and testing was done using the multipath device /dev/dm-0.

The test case involved disabling one physical Ethernet port on the storage system to observe its impact on I/O. The port was disabled destructively by physically turning it off on the switch. The result we observed: I/O from fio stopped for 5 to 15 seconds. After that, I/O resumed, with a sharp spike in IOPS that exceeded the configured rate_iops.

Analysis showed that fio paused during the time the failed path remained in the active state, but I/O operations sent to it had not yet failed. In other words, fio continued generating load only while operations were either completing successfully or failing quickly. The destructive port shutdown caused multipathd to delay marking the path as failed, which only happened after iSCSI-level timeouts expired. During this entire period, fio was stalled, waiting for the hanging I/O operations to complete.

Please note: the issue does not occur if rate_iops is not used. fio continues to generate test load as long as at least one path is still working.

When using Fibre Channel instead of iSCSI, this issue either does not occur at all or the fio pause is very short. We believe this is due to protocol differences: when FC ports are turned off, all nodes in the zone are notified, so the host driver quickly marks the path as failed. As a result, the delay is usually not noticeable.

Environment: Several linux distros: Ubuntu, Centos 8

fio version: 3.36, 3.39

Reproduction steps Configure the rate_iops parameter, start fio towards the multipath device, break one (or more) path.

it800 avatar Apr 04 '25 15:04 it800

@it800: Hmm, are you basically saying the problem isn't that I/Os are stalled (at all) but rather that:

  1. I/Os are stalled for too long
  2. When I/O returns it's at an unreasonably high rate ?

Just what will happen with multipathing will depend on your configuration and all the stacked timeouts. However from fio's perspective we could try and see what happens when we use rate_iops on a device mapper target when said device is suspended - it could be such disruption causes the target rate to go out of control when I/Os are allowed to continue...

sitsofe avatar Jul 21 '25 16:07 sitsofe