youki
youki copied to clipboard
Implement container restore functionality
ref : continues from https://github.com/containers/youki/issues/142
Currently youki supports checkpointing, (with command name checkpointt
) , but the restore part has not been implemented yet. We should do that.
Note Before anyone starts with this, make sure the following commands are working as expected on your system:
# run a loop which keeps printing numbers, with runc runtime ; in background
sudo podman run --runtime runc -dt fedora bash -c "v=0;while true;do sleep 1; echo \"\$v\"; let \"v++\";done;"
# get the name/id of the launched container
sudo podman ps
# this will attach current console to the container. DO NOT do ctrl+c to exit, instead use `a` key
# keep running for some time, let the number increase
sudo podman attach <container-id/name> --detach-keys=a
# enter a, and detach again
# checkpoint and shut-down container
sudo podman container checkpoint <container-id/name>
# wait for a bit
# restore the container
sudo podman container restore <container-id/name>
# attach again immediately
sudo podman attach <container-id/name> --detach-keys=a
# in the output you see should print numbers >greater than what we saw in previous attach with considerable range
After that implement restore in youki, rename the checkpointt
to checkpoint
and make the above work with youki instead of runc.
Another Note criu library is quite specific with which kernel versions it supports and need. If you run into criu failure with seg-fault , check previous issues on criu and check if you need to upgrade/downgrade library version for your kernel.
I ran above on ubuntu-based, kernel version 6.4.6 , criu v3.17.1 (3.16 does not work)
You can check the crun code for help : checkpoint : https://github.com/opencontainers/runc/blob/main/checkpoint.go restore : https://github.com/opencontainers/runc/blob/main/restore.go
Hey, I'm trying to research checkpoint
and restore
. However, I've noticed that there seem to be some problems with the current checkpoint
implementation.
> sudo podman run --runtime ~/rust_project/youki/youki -dt fedora bash -c "v=0;while true;do sleep 1; echo \"\$v\"; let \"v++\";done;"
> sudo podman container --runtime ~/rust_project/youki/youki ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
eb0b484cafdc registry.fedoraproject.org/fedora:latest bash -c v=0;while... 17 minutes ago Up 17 minutes ago musing_cohen
> sudo podman container --runtime ~/rust_project/youki/youki checkpoint eb0b484cafdc21a4d9
DEBUG youki: started by user 0 with ArgsOs { inner: ["/home/yjn/rust_project/youki/youki", "checkpoint", "--image-path", "/var/lib/containers/storage/overlay-containers/eb0b484cafdc21a4d9017f3723127e33a10366b5a963cd78f1d38127f681f280/userdata/checkpoint", "--work-path", "/var/lib/containers/storage/overlay-containers/eb0b484cafdc21a4d9017f3723127e33a10366b5a963cd78f1d38127f681f280/userdata", "eb0b484cafdc21a4d9017f3723127e33a10366b5a963cd78f1d38127f681f280"] }
DEBUG youki::commands::checkpoint: start checkpointing container eb0b484cafdc21a4d9017f3723127e33a10366b5a963cd78f1d38127f681f280
ERROR libcontainer::container::container_checkpoint: failed to open criu image directory path="/var/lib/containers/storage/overlay-containers/eb0b484cafdc21a4d9017f3723127e33a10366b5a963cd78f1d38127f681f280/userdata/checkpoint" err=Os { code: 2, kind: NotFound, message: "No such file or directory" }
ERROR youki: error in executing command: failed to checkpoint container eb0b484cafdc21a4d9017f3723127e33a10366b5a963cd78f1d38127f681f280
Caused by:
0: io error
1: No such file or directory (os error 2)
Error: failed to checkpoint container eb0b484cafdc21a4d9017f3723127e33a10366b5a963cd78f1d38127f681f280
Caused by:
0: io error
1: No such file or directory (os error 2)
Error: `/home/yjn/rust_project/youki/youki checkpoint --image-path /var/lib/containers/storage/overlay-containers/eb0b484cafdc21a4d9017f3723127e33a10366b5a963cd78f1d38127f681f280/userdata/checkpoint --work-path /var/lib/containers/storage/overlay-containers/eb0b484cafdc21a4d9017f3723127e33a10366b5a963cd78f1d38127f681f280/userdata eb0b484cafdc21a4d9017f3723127e33a10366b5a963cd78f1d38127f681f280` failed: exit status 1
Of course, I've made the necessary changes to rename the checkpointt
subcommand to checkpoint
.
> git diff | cat
diff --git a/crates/liboci-cli/src/lib.rs b/crates/liboci-cli/src/lib.rs
index 89c48a6d..03a5ae2e 100644
--- a/crates/liboci-cli/src/lib.rs
+++ b/crates/liboci-cli/src/lib.rs
@@ -50,7 +50,7 @@ pub enum StandardCmd {
// and other runtimes.
#[derive(Parser, Debug)]
pub enum CommonCmd {
- Checkpointt(Checkpoint),
+ Checkpoint(Checkpoint),
Events(Events),
Exec(Exec),
Features(Features),
diff --git a/crates/youki/src/main.rs b/crates/youki/src/main.rs
index 6a92be8d..7f0e23c7 100644
--- a/crates/youki/src/main.rs
+++ b/crates/youki/src/main.rs
@@ -116,7 +116,7 @@ fn main() -> Result<()> {
StandardCmd::State(state) => commands::state::state(state, root_path),
},
SubCommand::Common(cmd) => match *cmd {
- CommonCmd::Checkpointt(checkpoint) => {
+ CommonCmd::Checkpoint(checkpoint) => {
commands::checkpoint::checkpoint(checkpoint, root_path)
}
CommonCmd::Events(events) => commands::events::events(events, root_path),
I believe the cause of the error is likely not related to my system environment (e.g., CRIU) because I can perform checkpoint
and restore using podman
+ runc
.
After resolving the mentioned No such file or directory
error, there are still some CRIU-related errors:
> sudo podman container checkpoint fb8bc5974
DEBUG youki: started by user 0 with ArgsOs { inner: ["/home/yjn/rust_project/youki/youki", "checkpoint", "--image-path", "/var/lib/containers/storage/overlay-containers/fb8bc5974a8854d9d9a77d2438937a412f0bdf1e710f97c148981a19b0718eea/userdata/checkpoint", "--work-path", "/var/lib/containers/storage/overlay-containers/fb8bc5974a8854d9d9a77d2438937a412f0bdf1e710f97c148981a19b0718eea/userdata", "fb8bc5974a8854d9d9a77d2438937a412f0bdf1e710f97c148981a19b0718eea"] }
DEBUG youki::commands::checkpoint: start checkpointing container fb8bc5974a8854d9d9a77d2438937a412f0bdf1e710f97c148981a19b0718eea
ERROR libcontainer::container::container_checkpoint: checkpointing container failed err="CRIU RPC request failed with message:Error (criu/files-reg.c:1815): Can't lookup mount=26 for fd=0 path=/dev/pts/5\n error:0" id="fb8bc5974a8854d9d9a77d2438937a412f0bdf1e710f97c148981a19b0718eea" logfile="/var/lib/containers/storage/overlay-containers/fb8bc5974a8854d9d9a77d2438937a412f0bdf1e710f97c148981a19b0718eea/userdata/checkpoint/dump.log"
ERROR youki: error in executing command: failed to checkpoint container fb8bc5974a8854d9d9a77d2438937a412f0bdf1e710f97c148981a19b0718eea
Caused by:
CRIU RPC request failed with message:Error (criu/files-reg.c:1815): Can't lookup mount=26 for fd=0 path=/dev/pts/5
error:0
Error: failed to checkpoint container fb8bc5974a8854d9d9a77d2438937a412f0bdf1e710f97c148981a19b0718eea
Caused by:
CRIU RPC request failed with message:Error (criu/files-reg.c:1815): Can't lookup mount=26 for fd=0 path=/dev/pts/5
error:0
Error: `/home/yjn/rust_project/youki/youki checkpoint --image-path /var/lib/containers/storage/overlay-containers/fb8bc5974a8854d9d9a77d2438937a412f0bdf1e710f97c148981a19b0718eea/userdata/checkpoint --work-path /var/lib/containers/storage/overlay-containers/fb8bc5974a8854d9d9a77d2438937a412f0bdf1e710f97c148981a19b0718eea/userdata fb8bc5974a8854d9d9a77d2438937a412f0bdf1e710f97c148981a19b0718eea` failed: exit status 1
I'd like to know if these errors can be reproduced by others? Is it necessary to open a separate issue to address the potential problems with checkpoint
?
After resolving the mentioned No such file or directory error,
Hey, can you mention what what the issue behind this error, and how did you resolve it?
I'd like to know if these errors can be reproduced by others? Is it necessary to open a separate issue to address the potential problems with checkpoint?
I haven't tried running checkpoint before, I had assumed it was working. I will try running and checking for errors that you encountered. If this is a bug, and not just setup issue, then we can either fix it along with restore impl, or open separate issue.
can you mention what what the issue behind this error, and how did you resolve it?
It's quite simple, just create the missing directories directly. (It seems to be the case in runc as well.)
> git diff | cat -p -P
diff --git a/crates/libcontainer/src/container/container_checkpoint.rs b/crates/libcontainer/src/container/container_checkpoint.rs
index a6054734..25a08ba6 100644
--- a/crates/libcontainer/src/container/container_checkpoint.rs
+++ b/crates/libcontainer/src/container/container_checkpoint.rs
@@ -15,6 +15,10 @@ const DESCRIPTORS_JSON: &str = "descriptors.json";
impl Container {
pub fn checkpoint(&mut self, opts: &CheckpointOptions) -> Result<(), LibcontainerError> {
+ if !opts.image_path.is_dir() {
+ fs::create_dir_all(&opts.image_path).expect("failed.")
+ };
+
self.refresh_status()?;
// can_pause() checks if the container is running. That also works for
There is indeed some issue in checkpoint impl, as same error also occurs on my system as well. There is an issue open on criu that has similar to error https://github.com/checkpoint-restore/criu/issues/1785 , but needs more investigation on why it is happening. Thanks for checking and reporting. The initial issue of image_path
not existing also needs to be checked, verifying on how runc handles this...
Thanks for your sharing too~
verifying on how runc handles this
I noticed that, the way runc handles image_path
is also to create it directly. https://github.com/opencontainers/runc/blob/a32ad76da330c20c27b79ccbd20ff58629fc4b7d/libcontainer/criu_linux.go#L303C15-L303C15
CRIU RPC request failed with message:Error (criu/files-reg.c:1815): Can't lookup mount=26 for fd=0 path=/dev/pts/5
@adrianreber can you help us with some suggestions regarding this error in checkpointing ? I saw the issues https://github.com/checkpoint-restore/criu/issues/860 and https://github.com/checkpoint-restore/criu/issues/1785 , but the kernel issue in the first one does not seem applicable. As mentioned by @anti-entropy123 , the checkpointing is working with runc, so what can be a potential cause for this particular error, or what might be a good idea for trying to debug this?
Not sure what the question is, but I can only recommend not to use Ubuntu for CRIU. There are non upstream kernel patches which break CRIU all the time. Sorry.
Hey, sorry if I wasn't clear :
while trying out the current implementation of youki's checkpoint both @anti-entropy123 and me are getting error from criu CRIU RPC request failed with message:Error (criu/files-reg.c:1815): Can't lookup mount=26 for fd=0 path=/dev/pts/5
; even though running checkpoint with runc works fine and gives no error. I didn't think that kernel would be an issue as runc is successful in using criu to checkpoint.
The issues I linked in the previous comment were about the same error message, but the first one is regarding kernel problems (on which even runc was failing, hence I don't think it is applicable here), and second one is still open. I wanted to ask if you have any idea why this error might crop up, or any good place to start debugging why this error is getting thrown? Thanks :)
Ah, I see. The current implementation does only work without a connected terminal. To handle the terminal correctly additional steps are necessary. Especially during restore a callback is necessary to tell youki the correct tty FD.
You should look at crun as the criu rust bindings are closer to the c bindings from the architecture.
Thank you for your help. I removed the -t
flag, and now it's working fine. @adrianreber