firecracker
firecracker copied to clipboard
[Bug] Ongoing `read` Syscalls on VSocks Don't Get Interrupted After a Snapshot Resume
Describe the bug
Firecracker’s VSock connection reset is not working as expected after resuming from a snapshot. Specifically, the ongoing read syscall in the guest VM does not get interrupted, and the socat instance continues running instead of exiting due to the VSock connection reset.
To Reproduce
- Start Firecracker
- Create a VM with a VSock
- Start a VSock-over-UDS listener on the host with
socat
- In the guest VM, connect to the listener on the host through VSock with
socat
- Pause the VM and create a snapshot
- Stop the listener on the host
- Resume the VM
The socat
instance continues running and the ongoing read
syscall does not get interrupted/the connection reset has no effect. New read
and write
syscalls however fail as expected (which can be caused by e.g. pressing Enter in socat
), causing an EOF & thus causing socat
to exit as expected.
For the full reproduction steps, please see loopholelabs/firecracker-vsock-snapshot-reset-bug-reproducer. This includes the helper scripts and assets (kernel, rootfs) to reproduce the bug.
Expected behaviour
The Firecracker VSock docs state:
Firecracker handles sending the
reset
event to the vsock driver, thus the customers are no longer responsible for closing active connections.
From our reading, this should mean that the socat
instance running inside the guest VM, which has an ongoing read
syscall, exits due to the VSock connection being reset & the read
syscall being interrupted.
Environment
- Firecracker version: Firecracker v1.7.0
- Host and guest kernel versions: Host
6.10.4-200.fc40.x86_64
, guest6.1.89
- Rootfs used: Buildroot 2024.02.5. See the loopholelabs/firecracker-vsock-snapshot-reset-bug-reproducer for more details and the specific rootfs being used - this is also reproducible with Ubuntu and Alpine rootfses.
- Architecture:
x86_64
- Any other relevant software versions: The host is Fedora 40 on a Intel i7-1280P.
Additional context
How has this bug affected you/what are you trying to achieve: This bug affects the Drafter Agent System, which expects Firecracker to kill the active connection before it re-dial
s the host after a resume. Without the connection being killed by Firecracker, this does not work.
Do you have any idea of what the solution might be: Not a solution, but a workaround we've been trying to use is manually stopping the connection ourselves before snapshotting, but this causes a race condition on resume because the dial
loop in the guest will sometimes be killed by the Firecracker reset
ing the (new) connections after a resume. We've also been investigating a kernel return probe kprobe/virtio_vsock_reset_sock
to try and hold off with re-dial
ing after a resume until Firecracker has reset the connections if we're closing them ourselves before suspending, but preferably we would simply re-dial
after Firecracker resets the active connection/interrupts the read
syscall.
Checks
- [x] Have you searched the Firecracker Issues database for similar problems?
- [x] Have you read the existing relevant Firecracker documentation?
- [x] Are you certain the bug being reported is a Firecracker issue?