youki icon indicating copy to clipboard operation
youki copied to clipboard

Fully support OwnedFd

Open anti-entropy123 opened this issue 1 year ago • 60 comments

ref: issue #2317, PR #2369

Introducation Since nix version 0.27.1, some APIs (e.g., nix::pty::openpty(), nix::sys::socket::socket()) have started using OwnedFd to manage FDs. However, it is not straightforward to pass OwnedFd between forks. So in PR #2369, we still use RawFd in the subsequent steps.

This issue try to do the change from RawFd to OwnedFd.

anti-entropy123 avatar Oct 09 '23 06:10 anti-entropy123

Some ideas The main challenge with using OwnedFd is that ContainerArgs requires its fields to satisfy the Clone trait, but OwnedFd does not (and shouldn't, in my opinion) support it. Maybe we can impl Clone for structs like Sender, in which we can use OwnedFd::try_clone() to safely share underlying sockets or connections. If we do this, we also need to check whether old FD will be leaked due to CloneCb.

anti-entropy123 avatar Oct 09 '23 06:10 anti-entropy123

Hi, can i try this issue?? I'm very newbie for rust and container runtime, so it will be take a bit long time I think...

oirom avatar Dec 21 '23 01:12 oirom

Don't worry ;)

I'm very newbie for rust and container runtime, so it will be take a bit long time I think...

utam0k avatar Dec 21 '23 12:12 utam0k

@anti-entropy123 what should happen if OwnedFd::try_clone() returns Err?

say, when manually implementing the Clone trait for ContainerType

impl Clone for ContainerType {
    fn clone(&self) -> Self {
        match self {
            Self::InitContainer => Self::InitContainer,
            Self::TenantContainer { exec_notify_fd } => Self::TenantContainer { exec_notify_fd: exec_notify_fd.try_clone().unwrap() },
        }
    }
}

is unwrapping fine?

zahash avatar Mar 25 '24 15:03 zahash

i have a question. lets say i try_clone() a OwnedFd a bunch of times.

let a = f.try_clone().unwrap();
let b = f.try_clone().unwrap();

what if one of those variables is dropped? will the file close? if thats the case, then thats a big problem because having a OwnedFd instance assumes that the file is open

zahash avatar Mar 25 '24 16:03 zahash

does it make sense to do this maybe?

pub enum ContainerType {
    InitContainer,
    TenantContainer { exec_notify_fd: Rc<OwnedFd> },
}

zahash avatar Mar 25 '24 16:03 zahash

what should happen if OwnedFd::try_clone() returns Err?

say, when manually implementing the Clone trait for ContainerType

Directly unwrapping indeed doesn't seem quite appropriate...

Could we just implement a try_clone method for structures like ContainerType (should also include ContainerArgs), rather than derive(Clone)? This way, errors can be passed, reusing the error handling logic from before. 👀

This suggestion might not be fully matured, but I hope it helps you. @zahash

anti-entropy123 avatar Mar 25 '24 16:03 anti-entropy123

i have a question. lets say i try_clone() a OwnedFd a bunch of times.

let a = f.try_clone().unwrap();
let b = f.try_clone().unwrap();

what if one of those variables is dropped? will the file close? if thats the case, then thats a big problem because having a OwnedFd instance assumes that the file is open

@anti-entropy123 what about this issue then?

zahash avatar Mar 25 '24 16:03 zahash

what about this issue then?

I think closing one of the file descriptors won't affect the usability of the other one.

anti-entropy123 avatar Mar 25 '24 16:03 anti-entropy123

@anti-entropy123

but the OwnedFd calls close on the wrapped RawFd when dropped.

#[stable(feature = "io_safety", since = "1.63.0")]
impl Drop for OwnedFd {
    #[inline]
    fn drop(&mut self) {
        unsafe {
            // Note that errors are ignored when closing a file descriptor. The
            // reason for this is that if an error occurs we don't actually know if
            // the file descriptor was closed or not, and if we retried (for
            // something like EINTR), we might close another valid file descriptor
            // opened after we closed ours.
            #[cfg(not(target_os = "hermit"))]
            let _ = libc::close(self.fd);
            #[cfg(target_os = "hermit")]
            let _ = hermit_abi::close(self.fd);
        }
    }
}

zahash avatar Mar 25 '24 16:03 zahash

@zahash Sorry, I don't have a suitable Linux machine with me right now. I can try verifying the issue tomorrow morning.

TenantContainer { exec_notify_fd: Rc<OwnedFd> },

I've discussed this idea before; you can refer to this link for more information: https://github.com/containers/youki/pull/2369#issuecomment-1746761562

anti-entropy123 avatar Mar 25 '24 16:03 anti-entropy123

@anti-entropy123 @utam0k does cargo test run all the available tests?

zahash avatar Mar 25 '24 16:03 zahash

also, mmap no longer accepts f: Option<F> and instead just accepts f: F and with in mmap, it tries to unwrap_or(-1) to get the raw fd.

but now, i can't find a way to create a invalid Fd OwnedFd::from_raw_fd(-1) panics at runtime because rust doesn't allow -1 (asserts that fd != -1)

        mman::mmap(
            None,
            NonZeroUsize::new(default_stack_size).ok_or(CloneError::ZeroStackSize)?,
            mman::ProtFlags::PROT_READ | mman::ProtFlags::PROT_WRITE,
            mman::MapFlags::MAP_PRIVATE | mman::MapFlags::MAP_ANONYMOUS | mman::MapFlags::MAP_STACK,
            OwnedFd::from_raw_fd(-1), // used to be None
            0,
        )
        .map_err(CloneError::StackAllocation)?

so i decided to be a little naughty and use OwnedFd::from_raw_fd(-2) and there are no runtime panics due to assertions and all the tests are passing.

is this fine?

OwnedFd::from_raw_fd(RawFd::MAX) also works

zahash avatar Mar 25 '24 17:03 zahash

@zahash I think we can try using mmap_anonymous https://docs.rs/nix/latest/nix/sys/mman/fn.mmap_anonymous.html

anti-entropy123 avatar Mar 26 '24 01:03 anti-entropy123

what if one of those variables is dropped? will the file close? if thats the case, then thats a big problem because having a OwnedFd instance assumes that the file is open

I tested this code snippet, and it ran fine.

fn verify_try_clone(fd: OwnedFd) {
    let new_fd = fd.try_clone().unwrap();
    drop(fd);

    let mut new_file = unsafe { File::from_raw_fd(new_fd.as_raw_fd()) };
    new_file.write_all("hello world".as_bytes()).unwrap();
}

anti-entropy123 avatar Mar 26 '24 01:03 anti-entropy123

but the OwnedFd calls close on the wrapped RawFd when dropped.

@zahash The try_clone method of OwnedFd ultimately calls a syscall similar to dup2, so in the end, there are indeed two separate file descriptors. https://man7.org/linux/man-pages/man2/dup.2.html

anti-entropy123 avatar Mar 26 '24 01:03 anti-entropy123

@anti-entropy123 @utam0k

i was also thinking about having a ref datatype

pub enum ContainerTypeRef<'fd> {
    InitContainer,
    TenantContainer { exec_notify_id: BorrowedFd<'fd> }
}

that way, we may not have to necessarily clone all the time.

but the CloneCb in container_intermediate_process requires cloning the args.

    let cb: CloneCb = {
        let args = args.clone();
        ...
    };

zahash avatar Mar 26 '24 02:03 zahash

@anti-entropy123 @utam0k

i have made changes related to issues https://github.com/containers/youki/issues/2417 and https://github.com/containers/youki/issues/2723

all the tests pass. Please checkout the patch file. if you like where this is going and satisfied with most of the changes, i will create a pull request for further review.

diff --git a/crates/libcgroups/src/v1/memory.rs b/crates/libcgroups/src/v1/memory.rs
index 98fa6b1e..38b0dd07 100644
--- a/crates/libcgroups/src/v1/memory.rs
+++ b/crates/libcgroups/src/v1/memory.rs
@@ -332,7 +332,7 @@ impl Memory {
             Err(e) => {
                 // we need to look into the raw OS error for an EBUSY status
                 match e.inner().raw_os_error() {
-                    Some(code) => match Errno::from_i32(code) {
+                    Some(code) => match Errno::from_raw(code) {
                         Errno::EBUSY => {
                             let usage = Self::get_memory_usage(cgroup_root)?;
                             let max_usage = Self::get_memory_max_usage(cgroup_root)?;
diff --git a/crates/libcontainer/src/container/builder_impl.rs b/crates/libcontainer/src/container/builder_impl.rs
index 38f2f1cc..59d7a062 100644
--- a/crates/libcontainer/src/container/builder_impl.rs
+++ b/crates/libcontainer/src/container/builder_impl.rs
@@ -132,7 +132,7 @@ impl ContainerBuilderImpl {
             prctl::set_dumpable(false).map_err(|e| {
                 LibcontainerError::Other(format!(
                     "error in setting dumpable to false : {}",
-                    nix::errno::from_i32(e)
+                    nix::errno::Errno::from_raw(e)
                 ))
             })?;
         }
@@ -141,7 +141,7 @@ impl ContainerBuilderImpl {
         // therefore we will have to move all the variable by value. Since self
         // is a shared reference, we have to clone these variables here.
         let container_args = ContainerArgs {
-            container_type: self.container_type,
+            container_type: self.container_type.clone(),
             syscall: self.syscall,
             spec: Rc::clone(&self.spec),
             rootfs: self.rootfs.to_owned(),
diff --git a/crates/libcontainer/src/container/tenant_builder.rs b/crates/libcontainer/src/container/tenant_builder.rs
index 0d805a0d..2c2dbc88 100644
--- a/crates/libcontainer/src/container/tenant_builder.rs
+++ b/crates/libcontainer/src/container/tenant_builder.rs
@@ -1,6 +1,6 @@
 use caps::Capability;
 use nix::fcntl::OFlag;
-use nix::unistd::{self, close, pipe2, read, Pid};
+use nix::unistd::{self, pipe2, read, Pid};
 use oci_spec::runtime::{
     Capabilities as SpecCapabilities, Capability as SpecCapability, LinuxBuilder,
     LinuxCapabilities, LinuxCapabilitiesBuilder, LinuxNamespace, LinuxNamespaceBuilder,
@@ -8,6 +8,7 @@ use oci_spec::runtime::{
 };
 use procfs::process::Namespace;
 
+use std::os::fd::AsRawFd;
 use std::rc::Rc;
 use std::{
     collections::HashMap,
@@ -148,13 +149,13 @@ impl TenantContainerBuilder {
         let mut notify_socket = NotifySocket::new(notify_path);
         notify_socket.notify_container_start()?;
 
-        close(write_end).map_err(LibcontainerError::OtherSyscall)?;
+        drop(builder_impl); // this will close write_end fd
 
         let mut err_str_buf = Vec::new();
 
         loop {
             let mut buf = [0; 3];
-            match read(read_end, &mut buf).map_err(LibcontainerError::OtherSyscall)? {
+            match read(read_end.as_raw_fd(), &mut buf).map_err(LibcontainerError::OtherSyscall)? {
                 0 => {
                     if err_str_buf.is_empty() {
                         return Ok(pid);
diff --git a/crates/libcontainer/src/process/args.rs b/crates/libcontainer/src/process/args.rs
index ad37c07b..11c5d7f2 100644
--- a/crates/libcontainer/src/process/args.rs
+++ b/crates/libcontainer/src/process/args.rs
@@ -1,5 +1,6 @@
 use libcgroups::common::CgroupConfig;
 use oci_spec::runtime::Spec;
+use std::os::fd::OwnedFd;
 use std::os::unix::prelude::RawFd;
 use std::path::PathBuf;
 use std::rc::Rc;
@@ -9,10 +10,21 @@ use crate::notify_socket::NotifyListener;
 use crate::syscall::syscall::SyscallType;
 use crate::user_ns::UserNamespaceConfig;
 use crate::workload::Executor;
-#[derive(Debug, Copy, Clone)]
+#[derive(Debug)]
 pub enum ContainerType {
     InitContainer,
-    TenantContainer { exec_notify_fd: RawFd },
+    TenantContainer { exec_notify_fd: OwnedFd },
+}
+
+impl Clone for ContainerType {
+    fn clone(&self) -> Self {
+        match self {
+            Self::InitContainer => Self::InitContainer,
+            Self::TenantContainer { exec_notify_fd } => Self::TenantContainer {
+                exec_notify_fd: exec_notify_fd.try_clone().unwrap(/* is unwrap ok? */),
+            },
+        }
+    }
 }
 
 #[derive(Clone)]
diff --git a/crates/libcontainer/src/process/container_intermediate_process.rs b/crates/libcontainer/src/process/container_intermediate_process.rs
index 0085416b..7e5ac9c0 100644
--- a/crates/libcontainer/src/process/container_intermediate_process.rs
+++ b/crates/libcontainer/src/process/container_intermediate_process.rs
@@ -1,3 +1,5 @@
+use std::os::fd::AsRawFd;
+
 use crate::error::MissingSpecError;
 use crate::{namespaces::Namespaces, process::channel, process::fork};
 use libcgroups::common::CgroupManager;
@@ -128,12 +130,12 @@ pub fn container_intermediate_process(
                     if let Err(err) = main_sender.exec_failed(e.to_string()) {
                         tracing::error!(?err, "failed sending error to main sender");
                     }
-                    if let ContainerType::TenantContainer { exec_notify_fd } = args.container_type {
+                    if let ContainerType::TenantContainer { exec_notify_fd } = &args.container_type {
                         let buf = format!("{e}");
-                        if let Err(err) = write(exec_notify_fd, buf.as_bytes()) {
+                        if let Err(err) = write(&exec_notify_fd, buf.as_bytes()) {
                             tracing::error!(?err, "failed to write to exec notify fd");
                         }
-                        if let Err(err) = close(exec_notify_fd) {
+                        if let Err(err) = close(exec_notify_fd.as_raw_fd()) {
                             tracing::error!(?err, "failed to close exec notify fd");
                         }
                     }
@@ -157,8 +159,8 @@ pub fn container_intermediate_process(
     })?;
 
     // Close the exec_notify_fd in this process
-    if let ContainerType::TenantContainer { exec_notify_fd } = args.container_type {
-        close(exec_notify_fd).map_err(|err| {
+    if let ContainerType::TenantContainer { exec_notify_fd } = &args.container_type {
+        close(exec_notify_fd.as_raw_fd()).map_err(|err| {
             tracing::error!("failed to close exec notify fd: {}", err);
             IntermediateProcessError::ExecNotify(err)
         })?;
@@ -206,7 +208,7 @@ fn setup_userns(
     prctl::set_dumpable(true).map_err(|e| {
         IntermediateProcessError::Other(format!(
             "error in setting dumpable to true : {}",
-            nix::errno::from_i32(e)
+            nix::errno::Errno::from_raw(e)
         ))
     })?;
     sender.identifier_mapping_request().map_err(|err| {
@@ -220,7 +222,7 @@ fn setup_userns(
     prctl::set_dumpable(false).map_err(|e| {
         IntermediateProcessError::Other(format!(
             "error in setting dumplable to false : {}",
-            nix::errno::from_i32(e)
+            nix::errno::Errno::from_raw(e)
         ))
     })?;
     Ok(())
diff --git a/crates/libcontainer/src/process/fork.rs b/crates/libcontainer/src/process/fork.rs
index c0e05539..60a33f46 100644
--- a/crates/libcontainer/src/process/fork.rs
+++ b/crates/libcontainer/src/process/fork.rs
@@ -1,4 +1,4 @@
-use std::{ffi::c_int, fs::File, num::NonZeroUsize};
+use std::{ffi::c_int, num::NonZeroUsize};
 
 use libc::SIGCHLD;
 use nix::{
@@ -164,15 +164,17 @@ fn clone(cb: CloneCb, flags: u64, exit_signal: Option<u64>) -> Result<Pid, Clone
     // do not use MAP_GROWSDOWN since it is not well supported.
     // Ref: https://man7.org/linux/man-pages/man2/mmap.2.html
     let child_stack = unsafe {
-        // Since nix = "0.27.1", `mmap()` requires a generic type `F: AsFd`.
-        // `::<File>` doesn't have any meaning because we won't use it.
-        mman::mmap::<File>(
+        // Since nix = "0.27.1", `mmap()` requires a generic type `F: AsFd` and takes f: Option<F>.
+        // if f: None, then fd = -1 is used; (-1 is just an invalid fd)
+        // let fd = f.map(|f| f.as_fd().as_raw_fd()).unwrap_or(-1);
+        // Since nix = "0.28.0" mmap function accepts f: F instead of Option<F>
+        // but it is not possible to manually create OwnedFd::from_raw_fd(-1) because rust will panic due to assert
+        // So, mmap_anonymous is used which achieves the exact same behaviour
+        mman::mmap_anonymous(
             None,
             NonZeroUsize::new(default_stack_size).ok_or(CloneError::ZeroStackSize)?,
             mman::ProtFlags::PROT_READ | mman::ProtFlags::PROT_WRITE,
             mman::MapFlags::MAP_PRIVATE | mman::MapFlags::MAP_ANONYMOUS | mman::MapFlags::MAP_STACK,
-            None,
-            0,
         )
         .map_err(CloneError::StackAllocation)?
     };
@@ -187,7 +189,7 @@ fn clone(cb: CloneCb, flags: u64, exit_signal: Option<u64>) -> Result<Pid, Clone
 
     // Since the child stack for clone grows downward, we need to pass in
     // the top of the stack address.
-    let child_stack_top = unsafe { child_stack.add(default_stack_size) };
+    let child_stack_top = unsafe { child_stack.as_ptr().add(default_stack_size) };
 
     // Combine the clone flags with exit signals.
     let combined_flags = (flags | exit_signal.unwrap_or(0)) as c_int;
diff --git a/crates/libcontainer/src/seccomp/mod.rs b/crates/libcontainer/src/seccomp/mod.rs
index 2bfce4cb..a1acb131 100644
--- a/crates/libcontainer/src/seccomp/mod.rs
+++ b/crates/libcontainer/src/seccomp/mod.rs
@@ -332,7 +332,7 @@ mod tests {
             }
 
             if let Some(errno) = ret.err() {
-                if errno != nix::errno::from_i32(expect_error) {
+                if errno != nix::errno::Errno::from_raw(expect_error) {
                     Err(TestCallbackError::Custom(format!(
                         "getcwd failed but we didn't get the expected error from seccomp profile: {}",
                         errno
diff --git a/crates/libcontainer/src/syscall/linux.rs b/crates/libcontainer/src/syscall/linux.rs
index d01c4a54..35e5430c 100644
--- a/crates/libcontainer/src/syscall/linux.rs
+++ b/crates/libcontainer/src/syscall/linux.rs
@@ -314,7 +314,7 @@ impl Syscall for LinuxSyscall {
     fn set_id(&self, uid: Uid, gid: Gid) -> Result<()> {
         prctl::set_keep_capabilities(true).map_err(|errno| {
             tracing::error!(?errno, "failed to set keep capabilities to true");
-            nix::errno::from_i32(errno)
+            nix::errno::Errno::from_raw(errno)
         })?;
         // args : real *id, effective *id, saved set *id respectively
 
@@ -350,7 +350,7 @@ impl Syscall for LinuxSyscall {
         }
         prctl::set_keep_capabilities(false).map_err(|errno| {
             tracing::error!(?errno, "failed to set keep capabilities to false");
-            nix::errno::from_i32(errno)
+            nix::errno::Errno::from_raw(errno)
         })?;
         Ok(())
     }
diff --git a/tests/contest/contest/src/tests/seccomp_notify/seccomp_agent.rs b/tests/contest/contest/src/tests/seccomp_notify/seccomp_agent.rs
index 10664b2e..60337c4d 100644
--- a/tests/contest/contest/src/tests/seccomp_notify/seccomp_agent.rs
+++ b/tests/contest/contest/src/tests/seccomp_notify/seccomp_agent.rs
@@ -1,7 +1,7 @@
 use anyhow::{bail, Context, Result};
 use libcontainer::container::ContainerProcessState;
 use nix::{
-    sys::socket::{self, UnixAddr},
+    sys::socket::{self, Backlog, UnixAddr},
     unistd,
 };
 use std::{
@@ -35,7 +35,11 @@ pub fn recv_seccomp_listener(seccomp_listener: &Path) -> SeccompAgentResult {
     socket::bind(socket.as_raw_fd(), &addr).context("failed to bind to seccomp listener socket")?;
     // Force the backlog to be 1 so in the case of an error, only one connection
     // from clients will be waiting.
-    socket::listen(&socket.as_fd(), 1).context("failed to listen on seccomp listener")?;
+    socket::listen(
+        &socket.as_fd(),
+        Backlog::new(1).unwrap( /* safe to unwrap because Backlog(1) is always valid */ ),
+    )
+    .context("failed to listen on seccomp listener")?;
     let conn = match socket::accept(socket.as_raw_fd()) {
         Ok(conn) => conn,
         Err(e) => {
diff --git a/tests/contest/runtimetest/src/tests.rs b/tests/contest/runtimetest/src/tests.rs
index dee3afc7..d1d2f2ce 100644
--- a/tests/contest/runtimetest/src/tests.rs
+++ b/tests/contest/runtimetest/src/tests.rs
@@ -34,7 +34,7 @@ pub fn validate_readonly_paths(spec: &Spec) {
     // change manual matching of i32 to e.kind() and match statement
     for path in ro_paths {
         if let std::io::Result::Err(e) = test_read_access(path) {
-            let errno = Errno::from_i32(e.raw_os_error().unwrap());
+            let errno = Errno::from_raw(e.raw_os_error().unwrap());
             // In the integration tests we test for both existing and non-existing readonly paths
             // to be specified in the spec, so we allow ENOENT here
             if errno == Errno::ENOENT {
@@ -50,7 +50,7 @@ pub fn validate_readonly_paths(spec: &Spec) {
         }
 
         if let std::io::Result::Err(e) = test_write_access(path) {
-            let errno = Errno::from_i32(e.raw_os_error().unwrap());
+            let errno = Errno::from_raw(e.raw_os_error().unwrap());
             // In the integration tests we test for both existing and non-existing readonly paths
             // being specified in the spec, so we allow ENOENT, and we expect EROFS as the paths
             // should be read-only

zahash avatar Mar 26 '24 03:03 zahash

Hey @zahash , I'd suggest you wait till #2728 is merged, as that changes some of the things mentioned here, eg the mmap_anonymous. Also instead of using patch files, please either submit a PR, so link to a branch in your repo, as that would be easier for everyone to check and see.

YJDoc2 avatar Mar 26 '24 05:03 YJDoc2

@YJDoc2 oh, didn't realize there was already a pr to update nix. Fine then i will wait till it gets merged and then continue my work on the "Fully support OwnedFd" part

zahash avatar Mar 26 '24 05:03 zahash

also, @utam0k can i also be assigned to this issue?

zahash avatar Mar 26 '24 05:03 zahash

@oirom , do you still plan to work on this? Otherwise we'd probably re-assign this to @zahash

YJDoc2 avatar Mar 26 '24 06:03 YJDoc2

@zahash, have you considered the possibility of a FD leak here?

container_main_process.rs

pub fn container_main_process(container_args: &ContainerArgs) -> Result<(Pid, bool)> {
    ...
    let cb: CloneCb = {
        let container_args = container_args.clone();
        ...
        Box::new(move || {
        ...
        })
    };
    ...
    let intermediate_pid = fork::container_clone(cb)

fork.rs

fn clone(cb: CloneCb, flags: u64, exit_signal: Option<u64>) -> Result<Pid, CloneError> {
    ...
    extern "C" fn main(data: *mut libc::c_void) -> libc::c_int {
        unsafe { Box::from_raw(data as *mut CloneCb)() }
    }
    let ret = unsafe {
        libc::clone(
            main,
            child_stack_top,
            combined_flags,
            data as *mut libc::c_void,
        )
    };
    ...

It seems the child process doesn't have a chance to drop the OwnedFd outside of CloneCb. :thinking: (In the past, this wasn't a problem because they were all the same RawFd.)

anti-entropy123 avatar Mar 26 '24 08:03 anti-entropy123

@oirom , do you still plan to work on this? Otherwise we'd probably re-assign this to @zahash

Now this issue is too difficult to me, so please re-assign to @zahash

oirom avatar Mar 26 '24 08:03 oirom

Now this issue is too difficult to me, so please re-assign to @zahash

No worries, feel free to take a look at some other issues such as #361 if you still want to contribute!

@zahash , assigning to you :+1:

YJDoc2 avatar Mar 26 '24 08:03 YJDoc2

@zahash, have you considered the possibility of a FD leak here?

container_main_process.rs

pub fn container_main_process(container_args: &ContainerArgs) -> Result<(Pid, bool)> {
    ...
    let cb: CloneCb = {
        let container_args = container_args.clone();
        ...
        Box::new(move || {
        ...
        })
    };
    ...
    let intermediate_pid = fork::container_clone(cb)

fork.rs

fn clone(cb: CloneCb, flags: u64, exit_signal: Option<u64>) -> Result<Pid, CloneError> {
    ...
    extern "C" fn main(data: *mut libc::c_void) -> libc::c_int {
        unsafe { Box::from_raw(data as *mut CloneCb)() }
    }
    let ret = unsafe {
        libc::clone(
            main,
            child_stack_top,
            combined_flags,
            data as *mut libc::c_void,
        )
    };
    ...

It seems the child process doesn't have a chance to drop the OwnedFd outside of CloneCb. 🤔 (In the past, this wasn't a problem because they were all the same RawFd.)

@anti-entropy123 what about this? in the below line?

unsafe { drop(Box::from_raw(data)) };

zahash avatar Mar 26 '24 11:03 zahash

I think it seems only the parent process can reach this point. Additionally, this is still dropping CloneCb rather than releasing data outside of the closure.

anti-entropy123 avatar Mar 26 '24 13:03 anti-entropy123

@anti-entropy123

i don't think i fully understand what you are saying. can you please explain what you mean by "data outside of the closure"

can you also please explain a bit what the clone and clone3 functions are doing? its a bit hard to understand 😅

zahash avatar Mar 26 '24 15:03 zahash

@YJDoc2 is it feasible to use OwnedFd in Sender and Receiver? or is it too big of a change?

pub struct Receiver<T> {
    receiver: OwnedFd,
    phantom: PhantomData<T>,
}

pub struct Sender<T> {
    sender: OwnedFd,
    phantom: PhantomData<T>,
}

but OwnedFd cannot be cloned, especially when it comes to creating a CloneCb in container_main_process and container_intermediate_process. So, what should be the strategy here?

zahash avatar Mar 26 '24 18:03 zahash

@zahash Here's my understanding of execution steps, but it might not be accurate...

  1. When creating CloneCb, container_args along with its internal OwnedFd are cloned, and then moved into the closure.
  2. Subsequently, clone() is called to create a child process. The child process inherits all files opened by the parent process. (You can confirm this by logging)
fn get_current_fds_nums() -> usize {
    let proc_fd_dir = format!("/proc/{}/fd", process::id());
    fs::read_dir(proc_fd_dir).unwrap().collect::<Vec<_>>().len()
}

image image image 3. However, the entrypoint of the child process is set to a special main function, which simply calls the provided closure. The child process loses ownership of any variables outside of the closure. (This includes the original container_args as well.) So, it doesn't have the opportunity to execute the drop() function for those variables. image

Even though clone3() or clone() might be executed based on the system environment, the result should be the same.

anti-entropy123 avatar Mar 27 '24 03:03 anti-entropy123