firecracker icon indicating copy to clipboard operation
firecracker copied to clipboard

fix: asyncio not saving interrupt state

Open pkit opened this issue 2 weeks ago • 6 comments

During prepare_save we must unconditionally trigger an interrupt to ensure the guest gets notified after restore. The guest may have suppressed notifications, but after snapshot/restore it needs to be woken up regardless.

Fixes #5554

Changes

Fixes a bug where guest would hang indefinitely on interrupts after resume.

Reason

See above.

License Acceptance

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check CONTRIBUTING.md.

PR Checklist

  • [x] I have read and understand CONTRIBUTING.md.
  • [x] I have run tools/devtool checkbuild --all to verify that the PR passes build checks on all supported architectures.
  • [x] I have run tools/devtool checkstyle to verify that the PR passes the automated style checks.
  • [x] I have described what is done in these changes, why they are needed, and how they are solving the problem in a clear and encompassing way.
  • [ ] I have updated any relevant documentation (both in code and in the docs) in the PR.
  • [ ] I have mentioned all user-facing changes in CHANGELOG.md.
  • [ ] If a specific issue led to this PR, this PR closes the issue.
  • [ ] When making API changes, I have followed the Runbook for Firecracker API changes.
  • [ ] I have tested all new and changed functionalities in unit tests and/or integration tests.
  • [ ] I have linked an issue to every new TODO.

  • [ ] This functionality cannot be added in rust-vmm.

pkit avatar Dec 06 '25 23:12 pkit

I will add some tests soon.

pkit avatar Dec 06 '25 23:12 pkit

@dobrac I have improved your tests to iterate until pending ops queue is reproduced. Now it quite reliably repros in under 10 iterations for me locally.

pkit avatar Dec 08 '25 04:12 pkit

Codecov Report

:x: Patch coverage is 77.77778% with 2 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 83.23%. Comparing base (d130c7d) to head (97d23a9). :warning: Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
...vmm/src/devices/virtio/block/virtio/io/async_io.rs 33.33% 2 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5568      +/-   ##
==========================================
- Coverage   83.24%   83.23%   -0.01%     
==========================================
  Files         277      277              
  Lines       29263    29268       +5     
==========================================
+ Hits        24359    24362       +3     
- Misses       4904     4906       +2     
Flag Coverage Δ
5.10-m5n.metal 83.58% <77.77%> (+<0.01%) :arrow_up:
5.10-m6a.metal 82.91% <77.77%> (-0.01%) :arrow_down:
5.10-m6g.metal 80.19% <77.77%> (-0.01%) :arrow_down:
5.10-m6i.metal 83.57% <77.77%> (-0.01%) :arrow_down:
5.10-m7a.metal-48xl 82.90% <77.77%> (-0.01%) :arrow_down:
5.10-m7g.metal 80.19% <77.77%> (-0.01%) :arrow_down:
5.10-m7i.metal-24xl 83.55% <77.77%> (-0.01%) :arrow_down:
5.10-m7i.metal-48xl 83.54% <77.77%> (-0.02%) :arrow_down:
5.10-m8g.metal-24xl 80.18% <77.77%> (-0.01%) :arrow_down:
5.10-m8g.metal-48xl 80.19% <77.77%> (-0.01%) :arrow_down:
6.1-m5n.metal 83.60% <77.77%> (-0.01%) :arrow_down:
6.1-m6a.metal 82.94% <77.77%> (-0.02%) :arrow_down:
6.1-m6g.metal 80.19% <77.77%> (-0.01%) :arrow_down:
6.1-m6i.metal 83.60% <77.77%> (-0.01%) :arrow_down:
6.1-m7a.metal-48xl 82.93% <77.77%> (-0.01%) :arrow_down:
6.1-m7g.metal 80.18% <77.77%> (-0.01%) :arrow_down:
6.1-m7i.metal-24xl 83.61% <77.77%> (-0.01%) :arrow_down:
6.1-m7i.metal-48xl 83.61% <77.77%> (-0.02%) :arrow_down:
6.1-m8g.metal-24xl 80.18% <77.77%> (-0.01%) :arrow_down:
6.1-m8g.metal-48xl 80.19% <77.77%> (-0.01%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Dec 08 '25 14:12 codecov[bot]

Codecov idea of "coverage" seems incorrect here. Flagging a debug print is not the best use of coverage checks. So, ignored.

pkit avatar Dec 08 '25 20:12 pkit

@bchalios @kalyazin I'm not sure how to kick-off the codecov pass again, other than that this one should be ready.

pkit avatar Dec 09 '25 17:12 pkit

@bchalios @kalyazin I'm not sure how to kick-off the codecov pass again, other than that this one should be ready.

Codecov updates automatically every time the PR tests are running. Nothing more to do on your side. We just need to let the tests run

bchalios avatar Dec 10 '25 08:12 bchalios

@bchalios Here's the thread dump:

firecracker (pid=155):
[<0>] ep_poll+0x46a/0x4a0
[<0>] do_epoll_wait+0x58/0xd0
[<0>] do_compat_epoll_pwait.part.0+0x12/0x90
[<0>] __x64_sys_epoll_pwait+0x8c/0x150
[<0>] x64_sys_call+0x1814/0x2330
[<0>] do_syscall_64+0x81/0xc90
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

fc_api (pid=156):
[<0>] ep_poll+0x46a/0x4a0
[<0>] do_epoll_wait+0x58/0xd0
[<0>] do_compat_epoll_pwait.part.0+0x12/0x90
[<0>] __x64_sys_epoll_pwait+0x8c/0x150
[<0>] x64_sys_call+0x1814/0x2330
[<0>] do_syscall_64+0x81/0xc90
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

fc_vcpu 0 (pid=161):
[<0>] kvm_vcpu_block+0x4a/0xc0 [kvm]
[<0>] kvm_vcpu_halt+0xfc/0x4b0 [kvm]
[<0>] vcpu_run+0x202/0x280 [kvm]
[<0>] kvm_arch_vcpu_ioctl_run+0x351/0x510 [kvm]
[<0>] kvm_vcpu_ioctl+0x128/0x910 [kvm]
[<0>] __x64_sys_ioctl+0xa0/0x100
[<0>] x64_sys_call+0x1151/0x2330
[<0>] do_syscall_64+0x81/0xc90
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

fc_vcpu 1 (pid=162):
[<0>] kvm_vcpu_block+0x4a/0xc0 [kvm]
[<0>] kvm_vcpu_halt+0xfc/0x4b0 [kvm]
[<0>] vcpu_run+0x202/0x280 [kvm]
[<0>] kvm_arch_vcpu_ioctl_run+0x351/0x510 [kvm]
[<0>] kvm_vcpu_ioctl+0x128/0x910 [kvm]
[<0>] __x64_sys_ioctl+0xa0/0x100
[<0>] x64_sys_call+0x1151/0x2330
[<0>] do_syscall_64+0x81/0xc90
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

kvm-nx-lpage-re (pid=163):
[<0>] vhost_task_fn+0xea/0x110
[<0>] ret_from_fork+0x131/0x150
[<0>] ret_from_fork_asm+0x1a/0x30

As you can see both vCPUs are halted waiting for interrupt, so the interrupt is not really "unsolicited" (although technically it is). I tried various other approaches, but all of them essentially boil down to "send interrupt on save or on restore". Not sure which one is better.

pkit avatar Dec 10 '25 21:12 pkit

Hey all, I think I have found the exact root cause of the issue and opened a PR that solves it: #5582. @pkit you had correctly identified the issue, i.e. some state regarding interrupts was not being serialized properly, so resumed VMs were freezing. Your solution indeed worked, as far as I tested it, but was hiding the underlying issue in the following way:

This check:

if (force_signal && used_any) || queue.prepare_kick() {
...
}

is short-circuiting when force_singal && used_any is true, which means that the queue.prepare_kick() is never called and the internal queue state is not modified. The interrupt we send was not actually being sent in the guest in the case of snapshot resume, because part of the problem is that we are not saving the pending interrupt. The resumed microVMs though worked, because the emulation logic, eventually calls the missed prepare_kick() and it causes the interrupt to be sent.

I've addressed all of these (I hope), by fixing the ordering of the steps we take when we take a snapshot. Can I please ask you to try out my fix and see if it solves the problem for you?

bchalios avatar Dec 15 '25 11:12 bchalios

@bchalios I think your explanation is on-point:

The interrupt we send was not actually being sent in the guest in the case of snapshot resume, because part of the problem is that we are not saving the pending interrupt.

But I failed to identify where it wasn't saved, and if it's an order issue - it should be that! I was very busy these days (a lot of meetings). But I would test your solution later today. :)

pkit avatar Dec 15 '25 17:12 pkit

@bchalios I think your fix works well! The repro frequency is lower but still can be reproed:

================================================================================
Attempt 13/30 - Testing for non-zero async I/O drain
================================================================================
2025-12-16T00:58:24.910953439 [530719cc-7040-48ab-bdd1-a0f2da42f35c:main:DEBUG:src/vmm/src/devices/virtio/block/virtio/io/async_io.rs:223] drain_and_flush draining: pending_ops=32 discard_cqes=false
Caught 32 pending ops, testing restore...
SUCCESS: VM restored with 32 pending async ops
PASSED

I will close this one in lieu of #5582

pkit avatar Dec 16 '25 01:12 pkit