gvisor icon indicating copy to clipboard operation
gvisor copied to clipboard

Unable to checkpoint containers with `-leave-running`

Open luiscape opened this issue 1 year ago • 3 comments

Description

Checkpointing fails when using the -leave-running showing error:

destroying container: stopping container: gofer is still running

Also, checkpointing without that flags succeeds but the process exits with:

loading container: file does not exist

The resulting image works, though.

Steps to reproduce

# Start container
mkdir bundle && cd bundle
mkdir rootfs
docker export $(docker create ubuntu) | sudo tar -xf - -C rootfs --same-owner --same-permissions
sudo runsc spec -- sleep 60
sudo runsc run hello

# Attempt to checkpoint container
sudo runsc checkpoint -leave-running -image-path ./checkpoint hello

runsc version

runsc --version
runsc version 5eaa66a2ed33
spec: 1.1.0-rc.1

Debug logs

runsc.log.20240125-192522.932201.boot.txt runsc.log.20240125-192522.930835.gofer.txt runsc.log.20240125-192522.905068.run.txt

luiscape avatar Jan 25 '24 19:01 luiscape

Thanks, I can repro this.

runsc.log.*.checkpoint.txt:

I0126 05:14:41.437771  530689 main.go:199] **************** gVisor ****************
D0126 05:14:41.437836  530689 state_file.go:78] Load container, rootDir: "/var/run/runsc", id: {SandboxID: ContainerID:hello}, opts: {Exact:false SkipCheck:false TryLock:false RootContainer:false}
D0126 05:14:41.439123  530689 container.go:675] Signal container, cid: hello, signal: signal 0 (0)
D0126 05:14:41.439171  530689 sandbox.go:1211] Signal sandbox "hello"
D0126 05:14:41.439183  530689 sandbox.go:613] Connecting to sandbox "hello"
D0126 05:14:41.439486  530689 urpc.go:568] urpc: successfully marshalled 85 bytes.
D0126 05:14:41.440215  530689 urpc.go:611] urpc: unmarshal success.
D0126 05:14:41.440417  530689 container.go:722] Checkpoint container, cid: hello
D0126 05:14:41.440433  530689 sandbox.go:1255] Checkpoint sandbox "hello", options {Compression:flate-best-speed}
D0126 05:14:41.440470  530689 sandbox.go:613] Connecting to sandbox "hello"
D0126 05:14:41.440622  530689 urpc.go:568] urpc: successfully marshalled 105 bytes.
D0126 05:14:41.472078  530689 urpc.go:611] urpc: unmarshal success.
W0126 05:14:41.472303  530689 specutils.go:124] noNewPrivileges ignored. PR_SET_NO_NEW_PRIVS is assumed to always be set.
D0126 05:14:41.472458  530689 specutils.go:86] Spec:
{
...
}
D0126 05:14:41.472476  530689 container.go:792] Destroy container, cid: hello
D0126 05:14:41.472555  530689 container.go:1089] Destroying container, cid: hello
D0126 05:14:41.472569  530689 sandbox.go:1437] Destroying root container by destroying sandbox, cid: hello
D0126 05:14:41.472577  530689 sandbox.go:1186] Destroying sandbox "hello"
D0126 05:14:41.472599  530689 sandbox.go:1195] Killing sandbox "hello"
D0126 05:14:41.572995  530689 container.go:1103] Killing gofer for container, cid: hello, PID: 530577
W0126 05:14:46.573477  530689 container.go:816] stopping container: gofer is still running
W0126 05:14:46.573707  530689 util.go:64] FATAL ERROR: destroying container: stopping container: gofer is still running

ayushr2 avatar Jan 26 '24 05:01 ayushr2

I think I understand the issue. It is due to the following deadlock:

  • sudo runsc run hello is run in attached mode. So the parent runsc run process waits for the sandbox process to exit over here.
  • That same function has also deferred a Container.Destroy() call over here. So when the sandbox process exits (for whatever reason), the runsc run process can cleanup after the container.
  • sudo runsc checkpoint -leave-running also tries to call Container.Destroy() here. The Destroy() function takes a filesystem lock and attempts to destroy the sandbox.
  • runsc checkpoint first kills the sandbox process, which awakens the runsc run process and it also attempts to call Container.Destroy() but it blocks on the filesystem lock.
  • The runsc checkpoint process now starts waiting for the gofer process to disappear from the process table (while holding the filesystem lock). While the gofer process has exited (as the logs show), it has become defunct and is waiting for its parent (the runsc run process) to wait on it. But the runsc run process is waiting for the filesystem lock.

To immediately unblock yourself, running sudo runsc run --detach hello should help. Let me come up with a fix for this.

ayushr2 avatar Jan 26 '24 20:01 ayushr2

Nice! I'll take a look at runsc run --detach hello.

luiscape avatar Jan 29 '24 13:01 luiscape

This issue is fixed and checkpointing works with -leave-runnning flag.

nybidari avatar Apr 18 '24 19:04 nybidari