gvisor
gvisor copied to clipboard
Unable to checkpoint containers with `-leave-running`
Description
Checkpointing fails when using the -leave-running showing error:
destroying container: stopping container: gofer is still running
Also, checkpointing without that flags succeeds but the process exits with:
loading container: file does not exist
The resulting image works, though.
Steps to reproduce
# Start container
mkdir bundle && cd bundle
mkdir rootfs
docker export $(docker create ubuntu) | sudo tar -xf - -C rootfs --same-owner --same-permissions
sudo runsc spec -- sleep 60
sudo runsc run hello
# Attempt to checkpoint container
sudo runsc checkpoint -leave-running -image-path ./checkpoint hello
runsc version
runsc --version
runsc version 5eaa66a2ed33
spec: 1.1.0-rc.1
Debug logs
runsc.log.20240125-192522.932201.boot.txt runsc.log.20240125-192522.930835.gofer.txt runsc.log.20240125-192522.905068.run.txt
Thanks, I can repro this.
runsc.log.*.checkpoint.txt:
I0126 05:14:41.437771 530689 main.go:199] **************** gVisor ****************
D0126 05:14:41.437836 530689 state_file.go:78] Load container, rootDir: "/var/run/runsc", id: {SandboxID: ContainerID:hello}, opts: {Exact:false SkipCheck:false TryLock:false RootContainer:false}
D0126 05:14:41.439123 530689 container.go:675] Signal container, cid: hello, signal: signal 0 (0)
D0126 05:14:41.439171 530689 sandbox.go:1211] Signal sandbox "hello"
D0126 05:14:41.439183 530689 sandbox.go:613] Connecting to sandbox "hello"
D0126 05:14:41.439486 530689 urpc.go:568] urpc: successfully marshalled 85 bytes.
D0126 05:14:41.440215 530689 urpc.go:611] urpc: unmarshal success.
D0126 05:14:41.440417 530689 container.go:722] Checkpoint container, cid: hello
D0126 05:14:41.440433 530689 sandbox.go:1255] Checkpoint sandbox "hello", options {Compression:flate-best-speed}
D0126 05:14:41.440470 530689 sandbox.go:613] Connecting to sandbox "hello"
D0126 05:14:41.440622 530689 urpc.go:568] urpc: successfully marshalled 105 bytes.
D0126 05:14:41.472078 530689 urpc.go:611] urpc: unmarshal success.
W0126 05:14:41.472303 530689 specutils.go:124] noNewPrivileges ignored. PR_SET_NO_NEW_PRIVS is assumed to always be set.
D0126 05:14:41.472458 530689 specutils.go:86] Spec:
{
...
}
D0126 05:14:41.472476 530689 container.go:792] Destroy container, cid: hello
D0126 05:14:41.472555 530689 container.go:1089] Destroying container, cid: hello
D0126 05:14:41.472569 530689 sandbox.go:1437] Destroying root container by destroying sandbox, cid: hello
D0126 05:14:41.472577 530689 sandbox.go:1186] Destroying sandbox "hello"
D0126 05:14:41.472599 530689 sandbox.go:1195] Killing sandbox "hello"
D0126 05:14:41.572995 530689 container.go:1103] Killing gofer for container, cid: hello, PID: 530577
W0126 05:14:46.573477 530689 container.go:816] stopping container: gofer is still running
W0126 05:14:46.573707 530689 util.go:64] FATAL ERROR: destroying container: stopping container: gofer is still running
I think I understand the issue. It is due to the following deadlock:
sudo runsc run hellois run in attached mode. So the parentrunsc runprocess waits for the sandbox process to exit over here.- That same function has also deferred a
Container.Destroy()call over here. So when the sandbox process exits (for whatever reason), therunsc runprocess can cleanup after the container. sudo runsc checkpoint -leave-runningalso tries to callContainer.Destroy()here. TheDestroy()function takes a filesystem lock and attempts to destroy the sandbox.runsc checkpointfirst kills the sandbox process, which awakens therunsc runprocess and it also attempts to callContainer.Destroy()but it blocks on the filesystem lock.- The
runsc checkpointprocess now starts waiting for the gofer process to disappear from the process table (while holding the filesystem lock). While the gofer process has exited (as the logs show), it has become defunct and is waiting for its parent (therunsc runprocess) to wait on it. But therunsc runprocess is waiting for the filesystem lock.
To immediately unblock yourself, running sudo runsc run --detach hello should help. Let me come up with a fix for this.
Nice! I'll take a look at runsc run --detach hello.
This issue is fixed and checkpointing works with -leave-runnning flag.