blktests
blktests copied to clipboard
Expected behavior of "block/011 (disable PCI device while doing I/O)"
I'm not sure what the pass/fail criteria is or should be for this test. Since this test disables the pci device are we expecting that:
- PCI device is periodically re-enabled, transaction resubmitted, and everyone is happy.
- PCI device is disabled, transaction times out and device is removed
- Nothing happens
What I am observing is that the nvme driver correvtly identifies transactions as timing out, and removes the device:
nvme nvme0: I/O 14 QID 0 timeout, disable controller
nvme nvme0: Identify Controller failed (-4)
nvme nvme0: Removing after probe failure status: -5
However, once the device is removed, the test fails:
+fio: ioengines.c:289: td_io_queue: Assertion `(io_u->flags & IO_U_F_FLIGHT) == 0' failed.
+fio: ioengines.c:289: td_io_queue: Assertion `(io_u->flags & IO_U_F_FLIGHT) == 0' failed.
+fio: ioengines.c:289: td_io_queue: Assertion `(io_u->flags & IO_U_F_FLIGHT) == 0' failed.
So, what are we testing for here?
Hm, this looks like a fio bug. What version of fio are you running? (fio --version)
[root@g-prime blktests]# fio --version
fio-3.3
I hit the same fio error today, along with a BUG_ON in the msi code. I'll take a look.
FYI, block/011 constantly causes "Kernel panic - not syncing: 00: And NMI occurred" on old HP server (ProLiant BL460c G6 blade):
I think that on some broken hardware, there is no way to pass this test at all 8).
Is that with an ancient kernel? It looks like FFS is trying to notify OS of an error, and the OS doesn't know what's going on.
Now I found some time to work on this GitHub issue and block/011 :) My expectation is that the test case should pass. If it fails, it is a kernel bug or a test case bug. I myself observe block/011 failure symptoms.
To address the failures, I made three fixes for block/011 recently. One of them is a fix in fio which avoids the fio assertion failure that @mrnuke reported. The other two are fixes in the test case, 0bb9167 and f8f3321. With these fixes, the test case now can run more stable manner with QEMU NVME devices.
I think the test case still has three issues. The first issue is lockdep WARN with NVME devcies. It is a known issue in NVME driver. The second issue is long runtime. With QEMU NVME devices, the test case takes 4 hours at longest. This looks too much, and will need some test time limitation. The third issue I noticed recently is write failure to the test target device after running this test case (in most case, block/012 fails). This issue does not happen always, and need some more investigation to fix. I leave this GitHub issue as opened to track them.
I have applied two patches to blktests: e8c061c caps the runtime of block/011 up to 20 minutes by default, and avoids the long runtime issue. 1e6721b recovers test target device status to online or live, so that the following test cases are not affected. With these changes, now I feel confident that block/011 behaves good. If I see any failure, I can suspect kernel side bug. For example, the left issue of lockdep WARN with NVME devices should be fixed in kernel side.
It took long, but I'm happy to close this case :)