[Bug] Ongoing `read` Syscalls on VSocks Don't Get Interrupted After a Snapshot Resume

Open pojntfx opened this issue 6 months ago • 0 comments

Describe the bug

Firecracker’s VSock connection reset is not working as expected after resuming from a snapshot. Specifically, the ongoing read syscall in the guest VM does not get interrupted, and the socat instance continues running instead of exiting due to the VSock connection reset.

To Reproduce

Start Firecracker
Create a VM with a VSock
Start a VSock-over-UDS listener on the host with socat
In the guest VM, connect to the listener on the host through VSock with socat
Pause the VM and create a snapshot
Stop the listener on the host
Resume the VM

The socat instance continues running and the ongoing read syscall does not get interrupted/the connection reset has no effect. New read and write syscalls however fail as expected (which can be caused by e.g. pressing Enter in socat), causing an EOF & thus causing socat to exit as expected.

For the full reproduction steps, please see loopholelabs/firecracker-vsock-snapshot-reset-bug-reproducer. This includes the helper scripts and assets (kernel, rootfs) to reproduce the bug.

Expected behaviour

The Firecracker VSock docs state:

Firecracker handles sending the reset event to the vsock driver, thus the customers are no longer responsible for closing active connections.

From our reading, this should mean that the socat instance running inside the guest VM, which has an ongoing read syscall, exits due to the VSock connection being reset & the read syscall being interrupted.

Environment

Firecracker version: Firecracker v1.7.0
Host and guest kernel versions: Host 6.10.4-200.fc40.x86_64, guest 6.1.89
Rootfs used: Buildroot 2024.02.5. See the loopholelabs/firecracker-vsock-snapshot-reset-bug-reproducer for more details and the specific rootfs being used - this is also reproducible with Ubuntu and Alpine rootfses.
Architecture: x86_64
Any other relevant software versions: The host is Fedora 40 on a Intel i7-1280P.

Additional context

How has this bug affected you/what are you trying to achieve: This bug affects the Drafter Agent System, which expects Firecracker to kill the active connection before it re-dials the host after a resume. Without the connection being killed by Firecracker, this does not work.

Do you have any idea of what the solution might be: Not a solution, but a workaround we've been trying to use is manually stopping the connection ourselves before snapshotting, but this causes a race condition on resume because the dial loop in the guest will sometimes be killed by the Firecracker reseting the (new) connections after a resume. We've also been investigating a kernel return probe kprobe/virtio_vsock_reset_sock to try and hold off with re-dialing after a resume until Firecracker has reset the connections if we're closing them ourselves before suspending, but preferably we would simply re-dial after Firecracker resets the active connection/interrupts the read syscall.

Checks

[x] Have you searched the Firecracker Issues database for similar problems?
[x] Have you read the existing relevant Firecracker documentation?
[x] Are you certain the bug being reported is a Firecracker issue?

Aug 19 '24 21:08 pojntfx

firecracker firecracker copied to clipboard

[Bug] Ongoing `read` Syscalls on VSocks Don't Get Interrupted After a Snapshot Resume

Describe the bug

To Reproduce

Expected behaviour

Environment

Additional context

Checks

firecracker
firecracker copied to clipboard