runc Failure to run user namespaced container

Description

Unable to run user-namespaced container.

My setup is

containerd v1.7.0 (which supports usernamespaces)

ctr version
Client:
  Version:  v1.7.0
  Revision: 1fbd70374134b891f97ce19c70b6e50c7b9f4e0d
  Go version: go1.20.2

Server:
  Version:  v1.7.0
  Revision: 1fbd70374134b891f97ce19c70b6e50c7b9f4e0d
  UUID: 514e04fd-642e-4f20-a0bd-99b3bbdb3c65

runc version 1.1.4

runc --version
runc version 1.1.4
commit: v1.1.4-0-g5fd4c4d1
spec: 1.0.2-dev
go: go1.17.10
libseccomp: 2.5.4

Here are the commandline args being passed to runc by containerd

--root /run/containerd/runc/k8s.io --log /run/containerd/io.containerd.runtime.v2.task/k8s.io/0fb39cf81c4a554b1e7b0ce148a705cb8bdc3624b0ce8681541b73da87290b4b/log.json --log-format json --systemd-cgroup create --bundle /run/containerd/io.containerd.runtime.v2.task/k8s.io/0fb39cf81c4a554b1e7b0ce148a705cb8bdc3624b0ce8681541b73da87290b4b --pid-file /run/containerd/io.containerd.runtime.v2.task/k8s.io/0fb39cf81c4a554b1e7b0ce148a705cb8bdc3624b0ce8681541b73da87290b4b/init.pid 0fb39cf81c4a554b1e7b0ce148a705cb8bdc3624b0ce8681541b73da87290b4b

Here is the config.json

{"ociVersion":"1.1.0-rc.1","process":{"user":{"uid":65535,"gid":65535,"additionalGids":[65535]},"args":["/pause"],"env":["PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"],"cwd":"/","capabilities":{"bounding":["CAP_CHOWN","CAP_DAC_OVERRIDE","CAP_FSETID","CAP_FOWNER","CAP_MKNOD","CAP_NET_RAW","CAP_SETGID","CAP_SETUID","CAP_SETFCAP","CAP_SETPCAP","CAP_NET_BIND_SERVICE","CAP_SYS_CHROOT","CAP_KILL","CAP_AUDIT_WRITE"],"effective":["CAP_CHOWN","CAP_DAC_OVERRIDE","CAP_FSETID","CAP_FOWNER","CAP_MKNOD","CAP_NET_RAW","CAP_SETGID","CAP_SETUID","CAP_SETFCAP","CAP_SETPCAP","CAP_NET_BIND_SERVICE","CAP_SYS_CHROOT","CAP_KILL","CAP_AUDIT_WRITE"],"permitted":["CAP_CHOWN","CAP_DAC_OVERRIDE","CAP_FSETID","CAP_FOWNER","CAP_MKNOD","CAP_NET_RAW","CAP_SETGID","CAP_SETUID","CAP_SETFCAP","CAP_SETPCAP","CAP_NET_BIND_SERVICE","CAP_SYS_CHROOT","CAP_KILL","CAP_AUDIT_WRITE"]},"noNewPrivileges":true,"oomScoreAdj":-998},"root":{"path":"rootfs","readonly":true},"hostname":"user-namespace-vinaygo","mounts":[{"destination":"/proc","type":"proc","source":"proc","options":["nosuid","noexec","nodev"]},{"destination":"/dev","type":"tmpfs","source":"tmpfs","options":["nosuid","strictatime","mode=755","size=65536k"]},{"destination":"/dev/pts","type":"devpts","source":"devpts","options":["nosuid","noexec","newinstance","ptmxmode=0666","mode=0620","gid=5"]},{"destination":"/dev/mqueue","type":"mqueue","source":"mqueue","options":["nosuid","noexec","nodev"]},{"destination":"/sys","type":"sysfs","source":"sysfs","options":["nosuid","noexec","nodev","ro"]},{"destination":"/dev/shm","type":"bind","source":"/run/containerd/io.containerd.grpc.v1.cri/sandboxes/3b291dcc8869a5f32cade6834c46bd90a1217c298c0dcc8a5115393ce7c6f40a/shm","options":["rbind","ro","nosuid","nodev","noexec"]},{"destination":"/etc/resolv.conf","type":"bind","source":"/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/3b291dcc8869a5f32cade6834c46bd90a1217c298c0dcc8a5115393ce7c6f40a/resolv.conf","options":["rbind","ro"]}],"annotations":{"io.kubernetes.cri.container-type":"sandbox","io.kubernetes.cri.sandbox-cpu-period":"100000","io.kubernetes.cri.sandbox-cpu-quota":"0","io.kubernetes.cri.sandbox-cpu-shares":"2","io.kubernetes.cri.sandbox-id":"3b291dcc8869a5f32cade6834c46bd90a1217c298c0dcc8a5115393ce7c6f40a","io.kubernetes.cri.sandbox-log-directory":"/var/log/pods/default_user-namespace-vinaygo_80bd4a09-b19b-4d81-800b-6b5d605b1558","io.kubernetes.cri.sandbox-memory":"0","io.kubernetes.cri.sandbox-name":"user-namespace-vinaygo","io.kubernetes.cri.sandbox-namespace":"default","io.kubernetes.cri.sandbox-uid":"80bd4a09-b19b-4d81-800b-6b5d605b1558"},"linux":{"uidMappings":[{"containerID":0,"hostID":2515861504,"size":65536}],"gidMappings":[{"containerID":0,"hostID":2515861504,"size":65536}],"sysctl":{"net.core.somaxconn":"1024","net.ipv4.conf.all.accept_redirects":"0","net.ipv4.conf.all.forwarding":"1","net.ipv4.conf.all.route_localnet":"1","net.ipv4.conf.default.forwarding":"1","net.ipv4.ip_forward":"1","net.ipv4.tcp_fin_timeout":"60","net.ipv4.tcp_keepalive_intvl":"60","net.ipv4.tcp_keepalive_probes":"5","net.ipv4.tcp_keepalive_time":"300","net.ipv4.tcp_rmem":"4096 87380 6291456","net.ipv4.tcp_syn_retries":"6","net.ipv4.tcp_tw_reuse":"0","net.ipv4.tcp_wmem":"4096 16384 4194304","net.ipv4.udp_rmem_min":"4096","net.ipv4.udp_wmem_min":"4096","net.ipv6.conf.all.disable_ipv6":"1","net.ipv6.conf.default.accept_ra":"0","net.ipv6.conf.default.disable_ipv6":"1","net.netfilter.nf_conntrack_generic_timeout":"600","net.netfilter.nf_conntrack_tcp_be_liberal":"1","net.netfilter.nf_conntrack_tcp_timeout_close_wait":"3600","net.netfilter.nf_conntrack_tcp_timeout_established":"86400"},"resources":{"devices":[{"allow":false,"access":"rwm"}],"cpu":{"shares":2}},"cgroupsPath":"kubepods-besteffort-pod80bd4a09_b19b_4d81_800b_6b5d605b1558.slice:cri-containerd:3b291dcc8869a5f32cade6834c46bd90a1217c298c0dcc8a5115393ce7c6f40a","namespaces":[{"type":"pid"},{"type":"ipc"},{"type":"uts"},{"type":"mount"},{"type":"network"},{"type":"user"}],"seccomp":{"defaultAction":"SCMP_ACT_ERRNO","architectures":["SCMP_ARCH_X86_64","SCMP_ARCH_X86","SCMP_ARCH_X32"],"syscalls":[{"names":["accept","accept4","access","adjtimex","alarm","bind","brk","capget","capset","chdir","chmod","chown","chown32","clock_adjtime","clock_adjtime64","clock_getres","clock_getres_time64","clock_gettime","clock_gettime64","clock_nanosleep","clock_nanosleep_time64","close","close_range","connect","copy_file_range","creat","dup","dup2","dup3","epoll_create","epoll_create1","epoll_ctl","epoll_ctl_old","epoll_pwait","epoll_pwait2","epoll_wait","epoll_wait_old","eventfd","eventfd2","execve","execveat","exit","exit_group","faccessat","faccessat2","fadvise64","fadvise64_64","fallocate","fanotify_mark","fchdir","fchmod","fchmodat","fchown","fchown32","fchownat","fcntl","fcntl64","fdatasync","fgetxattr","flistxattr","flock","fork","fremovexattr","fsetxattr","fstat","fstat64","fstatat64","fstatfs","fstatfs64","fsync","ftruncate","ftruncate64","futex","futex_time64","futex_waitv","futimesat","getcpu","getcwd","getdents","getdents64","getegid","getegid32","geteuid","geteuid32","getgid","getgid32","getgroups","getgroups32","getitimer","getpeername","getpgid","getpgrp","getpid","getppid","getpriority","getrandom","getresgid","getresgid32","getresuid","getresuid32","getrlimit","get_robust_list","getrusage","getsid","getsockname","getsockopt","get_thread_area","gettid","gettimeofday","getuid","getuid32","getxattr","inotify_add_watch","inotify_init","inotify_init1","inotify_rm_watch","io_cancel","ioctl","io_destroy","io_getevents","io_pgetevents","io_pgetevents_time64","ioprio_get","ioprio_set","io_setup","io_submit","io_uring_enter","io_uring_register","io_uring_setup","ipc","kill","landlock_add_rule","landlock_create_ruleset","landlock_restrict_self","lchown","lchown32","lgetxattr","link","linkat","listen","listxattr","llistxattr","_llseek","lremovexattr","lseek","lsetxattr","lstat","lstat64","madvise","membarrier","memfd_create","memfd_secret","mincore","mkdir","mkdirat","mknod","mknodat","mlock","mlock2","mlockall","mmap","mmap2","mprotect","mq_getsetattr","mq_notify","mq_open","mq_timedreceive","mq_timedreceive_time64","mq_timedsend","mq_timedsend_time64","mq_unlink","mremap","msgctl","msgget","msgrcv","msgsnd","msync","munlock","munlockall","munmap","nanosleep","newfstatat","_newselect","open","openat","openat2","pause","pidfd_open","pidfd_send_signal","pipe","pipe2","pkey_alloc","pkey_free","pkey_mprotect","poll","ppoll","ppoll_time64","prctl","pread64","preadv","preadv2","prlimit64","process_mrelease","pselect6","pselect6_time64","pwrite64","pwritev","pwritev2","read","readahead","readlink","readlinkat","readv","recv","recvfrom","recvmmsg","recvmmsg_time64","recvmsg","remap_file_pages","removexattr","rename","renameat","renameat2","restart_syscall","rmdir","rseq","rt_sigaction","rt_sigpending","rt_sigprocmask","rt_sigqueueinfo","rt_sigreturn","rt_sigsuspend","rt_sigtimedwait","rt_sigtimedwait_time64","rt_tgsigqueueinfo","sched_getaffinity","sched_getattr","sched_getparam","sched_get_priority_max","sched_get_priority_min","sched_getscheduler","sched_rr_get_interval","sched_rr_get_interval_time64","sched_setaffinity","sched_setattr","sched_setparam","sched_setscheduler","sched_yield","seccomp","select","semctl","semget","semop","semtimedop","semtimedop_time64","send","sendfile","sendfile64","sendmmsg","sendmsg","sendto","setfsgid","setfsgid32","setfsuid","setfsuid32","setgid","setgid32","setgroups","setgroups32","setitimer","setpgid","setpriority","setregid","setregid32","setresgid","setresgid32","setresuid","setresuid32","setreuid","setreuid32","setrlimit","set_robust_list","setsid","setsockopt","set_thread_area","set_tid_address","setuid","setuid32","setxattr","shmat","shmctl","shmdt","shmget","shutdown","sigaltstack","signalfd","signalfd4","sigprocmask","sigreturn","socketcall","socketpair","splice","stat","stat64","statfs","statfs64","statx","symlink","symlinkat","sync","sync_file_range","syncfs","sysinfo","tee","tgkill","time","timer_create","timer_delete","timer_getoverrun","timer_gettime","timer_gettime64","timer_settime","timer_settime64","timerfd_create","timerfd_gettime","timerfd_gettime64","timerfd_settime","timerfd_settime64","times","tkill","truncate","truncate64","ugetrlimit","umask","uname","unlink","unlinkat","utime","utimensat","utimensat_time64","utimes","vfork","vmsplice","wait4","waitid","waitpid","write","writev"],"action":"SCMP_ACT_ALLOW"},{"names":["socket"],"action":"SCMP_ACT_ALLOW","args":[{"index":0,"value":40,"op":"SCMP_CMP_NE"}]},{"names":["personality"],"action":"SCMP_ACT_ALLOW","args":[{"index":0,"value":0,"op":"SCMP_CMP_EQ"}]},{"names":["personality"],"action":"SCMP_ACT_ALLOW","args":[{"index":0,"value":8,"op":"SCMP_CMP_EQ"}]},{"names":["personality"],"action":"SCMP_ACT_ALLOW","args":[{"index":0,"value":131072,"op":"SCMP_CMP_EQ"}]},{"names":["personality"],"action":"SCMP_ACT_ALLOW","args":[{"index":0,"value":131080,"op":"SCMP_CMP_EQ"}]},{"names":["personality"],"action":"SCMP_ACT_ALLOW","args":[{"index":0,"value":4294967295,"op":"SCMP_CMP_EQ"}]},{"names":["process_vm_readv","process_vm_writev","ptrace"],"action":"SCMP_ACT_ALLOW"},{"names":["arch_prctl","modify_ldt"],"action":"SCMP_ACT_ALLOW"},{"names":["chroot"],"action":"SCMP_ACT_ALLOW"},{"names":["clone"],"action":"SCMP_ACT_ALLOW","args":[{"index":0,"value":2114060288,"op":"SCMP_CMP_MASKED_EQ"}]},{"names":["clone3"],"action":"SCMP_ACT_ERRNO","errnoRet":38}]},"maskedPaths":["/proc/acpi","/proc/asound","/proc/kcore","/proc/keys","/proc/latency_stats","/proc/timer_list","/proc/timer_stats","/proc/sched_debug","/sys/firmware","/proc/scsi"],"readonlyPaths":["/proc/bus","/proc/fs","/proc/irq","/proc/sys","/proc/sysrq-trigger"]}}

Steps to reproduce the issue

With containerd 1.7.0 and runc 1.1.4 installed run the following:

create a container with the following config.json

{"ociVersion":"1.1.0-rc.1","process":{"user":{"uid":65535,"gid":65535,"additionalGids":[65535]},"args":["/pause"],"env":["PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"],"cwd":"/","capabilities":{"bounding":["CAP_CHOWN","CAP_DAC_OVERRIDE","CAP_FSETID","CAP_FOWNER","CAP_MKNOD","CAP_NET_RAW","CAP_SETGID","CAP_SETUID","CAP_SETFCAP","CAP_SETPCAP","CAP_NET_BIND_SERVICE","CAP_SYS_CHROOT","CAP_KILL","CAP_AUDIT_WRITE"],"effective":["CAP_CHOWN","CAP_DAC_OVERRIDE","CAP_FSETID","CAP_FOWNER","CAP_MKNOD","CAP_NET_RAW","CAP_SETGID","CAP_SETUID","CAP_SETFCAP","CAP_SETPCAP","CAP_NET_BIND_SERVICE","CAP_SYS_CHROOT","CAP_KILL","CAP_AUDIT_WRITE"],"permitted":["CAP_CHOWN","CAP_DAC_OVERRIDE","CAP_FSETID","CAP_FOWNER","CAP_MKNOD","CAP_NET_RAW","CAP_SETGID","CAP_SETUID","CAP_SETFCAP","CAP_SETPCAP","CAP_NET_BIND_SERVICE","CAP_SYS_CHROOT","CAP_KILL","CAP_AUDIT_WRITE"]},"noNewPrivileges":true,"oomScoreAdj":-998},"root":{"path":"rootfs","readonly":true},"hostname":"user-namespace-vinaygo","mounts":[{"destination":"/proc","type":"proc","source":"proc","options":["nosuid","noexec","nodev"]},{"destination":"/dev","type":"tmpfs","source":"tmpfs","options":["nosuid","strictatime","mode=755","size=65536k"]},{"destination":"/dev/pts","type":"devpts","source":"devpts","options":["nosuid","noexec","newinstance","ptmxmode=0666","mode=0620","gid=5"]},{"destination":"/dev/mqueue","type":"mqueue","source":"mqueue","options":["nosuid","noexec","nodev"]},{"destination":"/sys","type":"sysfs","source":"sysfs","options":["nosuid","noexec","nodev","ro"]},{"destination":"/dev/shm","type":"bind","source":"/run/containerd/io.containerd.grpc.v1.cri/sandboxes/3b291dcc8869a5f32cade6834c46bd90a1217c298c0dcc8a5115393ce7c6f40a/shm","options":["rbind","ro","nosuid","nodev","noexec"]},{"destination":"/etc/resolv.conf","type":"bind","source":"/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/3b291dcc8869a5f32cade6834c46bd90a1217c298c0dcc8a5115393ce7c6f40a/resolv.conf","options":["rbind","ro"]}],"annotations":{"io.kubernetes.cri.container-type":"sandbox","io.kubernetes.cri.sandbox-cpu-period":"100000","io.kubernetes.cri.sandbox-cpu-quota":"0","io.kubernetes.cri.sandbox-cpu-shares":"2","io.kubernetes.cri.sandbox-id":"3b291dcc8869a5f32cade6834c46bd90a1217c298c0dcc8a5115393ce7c6f40a","io.kubernetes.cri.sandbox-log-directory":"/var/log/pods/default_user-namespace-vinaygo_80bd4a09-b19b-4d81-800b-6b5d605b1558","io.kubernetes.cri.sandbox-memory":"0","io.kubernetes.cri.sandbox-name":"user-namespace-vinaygo","io.kubernetes.cri.sandbox-namespace":"default","io.kubernetes.cri.sandbox-uid":"80bd4a09-b19b-4d81-800b-6b5d605b1558"},"linux":{"uidMappings":[{"containerID":0,"hostID":2515861504,"size":65536}],"gidMappings":[{"containerID":0,"hostID":2515861504,"size":65536}],"sysctl":{"net.core.somaxconn":"1024","net.ipv4.conf.all.accept_redirects":"0","net.ipv4.conf.all.forwarding":"1","net.ipv4.conf.all.route_localnet":"1","net.ipv4.conf.default.forwarding":"1","net.ipv4.ip_forward":"1","net.ipv4.tcp_fin_timeout":"60","net.ipv4.tcp_keepalive_intvl":"60","net.ipv4.tcp_keepalive_probes":"5","net.ipv4.tcp_keepalive_time":"300","net.ipv4.tcp_rmem":"4096 87380 6291456","net.ipv4.tcp_syn_retries":"6","net.ipv4.tcp_tw_reuse":"0","net.ipv4.tcp_wmem":"4096 16384 4194304","net.ipv4.udp_rmem_min":"4096","net.ipv4.udp_wmem_min":"4096","net.ipv6.conf.all.disable_ipv6":"1","net.ipv6.conf.default.accept_ra":"0","net.ipv6.conf.default.disable_ipv6":"1","net.netfilter.nf_conntrack_generic_timeout":"600","net.netfilter.nf_conntrack_tcp_be_liberal":"1","net.netfilter.nf_conntrack_tcp_timeout_close_wait":"3600","net.netfilter.nf_conntrack_tcp_timeout_established":"86400"},"resources":{"devices":[{"allow":false,"access":"rwm"}],"cpu":{"shares":2}},"cgroupsPath":"kubepods-besteffort-pod80bd4a09_b19b_4d81_800b_6b5d605b1558.slice:cri-containerd:3b291dcc8869a5f32cade6834c46bd90a1217c298c0dcc8a5115393ce7c6f40a","namespaces":[{"type":"pid"},{"type":"ipc"},{"type":"uts"},{"type":"mount"},{"type":"network"},{"type":"user"}],"seccomp":{"defaultAction":"SCMP_ACT_ERRNO","architectures":["SCMP_ARCH_X86_64","SCMP_ARCH_X86","SCMP_ARCH_X32"],"syscalls":[{"names":["accept","accept4","access","adjtimex","alarm","bind","brk","capget","capset","chdir","chmod","chown","chown32","clock_adjtime","clock_adjtime64","clock_getres","clock_getres_time64","clock_gettime","clock_gettime64","clock_nanosleep","clock_nanosleep_time64","close","close_range","connect","copy_file_range","creat","dup","dup2","dup3","epoll_create","epoll_create1","epoll_ctl","epoll_ctl_old","epoll_pwait","epoll_pwait2","epoll_wait","epoll_wait_old","eventfd","eventfd2","execve","execveat","exit","exit_group","faccessat","faccessat2","fadvise64","fadvise64_64","fallocate","fanotify_mark","fchdir","fchmod","fchmodat","fchown","fchown32","fchownat","fcntl","fcntl64","fdatasync","fgetxattr","flistxattr","flock","fork","fremovexattr","fsetxattr","fstat","fstat64","fstatat64","fstatfs","fstatfs64","fsync","ftruncate","ftruncate64","futex","futex_time64","futex_waitv","futimesat","getcpu","getcwd","getdents","getdents64","getegid","getegid32","geteuid","geteuid32","getgid","getgid32","getgroups","getgroups32","getitimer","getpeername","getpgid","getpgrp","getpid","getppid","getpriority","getrandom","getresgid","getresgid32","getresuid","getresuid32","getrlimit","get_robust_list","getrusage","getsid","getsockname","getsockopt","get_thread_area","gettid","gettimeofday","getuid","getuid32","getxattr","inotify_add_watch","inotify_init","inotify_init1","inotify_rm_watch","io_cancel","ioctl","io_destroy","io_getevents","io_pgetevents","io_pgetevents_time64","ioprio_get","ioprio_set","io_setup","io_submit","io_uring_enter","io_uring_register","io_uring_setup","ipc","kill","landlock_add_rule","landlock_create_ruleset","landlock_restrict_self","lchown","lchown32","lgetxattr","link","linkat","listen","listxattr","llistxattr","_llseek","lremovexattr","lseek","lsetxattr","lstat","lstat64","madvise","membarrier","memfd_create","memfd_secret","mincore","mkdir","mkdirat","mknod","mknodat","mlock","mlock2","mlockall","mmap","mmap2","mprotect","mq_getsetattr","mq_notify","mq_open","mq_timedreceive","mq_timedreceive_time64","mq_timedsend","mq_timedsend_time64","mq_unlink","mremap","msgctl","msgget","msgrcv","msgsnd","msync","munlock","munlockall","munmap","nanosleep","newfstatat","_newselect","open","openat","openat2","pause","pidfd_open","pidfd_send_signal","pipe","pipe2","pkey_alloc","pkey_free","pkey_mprotect","poll","ppoll","ppoll_time64","prctl","pread64","preadv","preadv2","prlimit64","process_mrelease","pselect6","pselect6_time64","pwrite64","pwritev","pwritev2","read","readahead","readlink","readlinkat","readv","recv","recvfrom","recvmmsg","recvmmsg_time64","recvmsg","remap_file_pages","removexattr","rename","renameat","renameat2","restart_syscall","rmdir","rseq","rt_sigaction","rt_sigpending","rt_sigprocmask","rt_sigqueueinfo","rt_sigreturn","rt_sigsuspend","rt_sigtimedwait","rt_sigtimedwait_time64","rt_tgsigqueueinfo","sched_getaffinity","sched_getattr","sched_getparam","sched_get_priority_max","sched_get_priority_min","sched_getscheduler","sched_rr_get_interval","sched_rr_get_interval_time64","sched_setaffinity","sched_setattr","sched_setparam","sched_setscheduler","sched_yield","seccomp","select","semctl","semget","semop","semtimedop","semtimedop_time64","send","sendfile","sendfile64","sendmmsg","sendmsg","sendto","setfsgid","setfsgid32","setfsuid","setfsuid32","setgid","setgid32","setgroups","setgroups32","setitimer","setpgid","setpriority","setregid","setregid32","setresgid","setresgid32","setresuid","setresuid32","setreuid","setreuid32","setrlimit","set_robust_list","setsid","setsockopt","set_thread_area","set_tid_address","setuid","setuid32","setxattr","shmat","shmctl","shmdt","shmget","shutdown","sigaltstack","signalfd","signalfd4","sigprocmask","sigreturn","socketcall","socketpair","splice","stat","stat64","statfs","statfs64","statx","symlink","symlinkat","sync","sync_file_range","syncfs","sysinfo","tee","tgkill","time","timer_create","timer_delete","timer_getoverrun","timer_gettime","timer_gettime64","timer_settime","timer_settime64","timerfd_create","timerfd_gettime","timerfd_gettime64","timerfd_settime","timerfd_settime64","times","tkill","truncate","truncate64","ugetrlimit","umask","uname","unlink","unlinkat","utime","utimensat","utimensat_time64","utimes","vfork","vmsplice","wait4","waitid","waitpid","write","writev"],"action":"SCMP_ACT_ALLOW"},{"names":["socket"],"action":"SCMP_ACT_ALLOW","args":[{"index":0,"value":40,"op":"SCMP_CMP_NE"}]},{"names":["personality"],"action":"SCMP_ACT_ALLOW","args":[{"index":0,"value":0,"op":"SCMP_CMP_EQ"}]},{"names":["personality"],"action":"SCMP_ACT_ALLOW","args":[{"index":0,"value":8,"op":"SCMP_CMP_EQ"}]},{"names":["personality"],"action":"SCMP_ACT_ALLOW","args":[{"index":0,"value":131072,"op":"SCMP_CMP_EQ"}]},{"names":["personality"],"action":"SCMP_ACT_ALLOW","args":[{"index":0,"value":131080,"op":"SCMP_CMP_EQ"}]},{"names":["personality"],"action":"SCMP_ACT_ALLOW","args":[{"index":0,"value":4294967295,"op":"SCMP_CMP_EQ"}]},{"names":["process_vm_readv","process_vm_writev","ptrace"],"action":"SCMP_ACT_ALLOW"},{"names":["arch_prctl","modify_ldt"],"action":"SCMP_ACT_ALLOW"},{"names":["chroot"],"action":"SCMP_ACT_ALLOW"},{"names":["clone"],"action":"SCMP_ACT_ALLOW","args":[{"index":0,"value":2114060288,"op":"SCMP_CMP_MASKED_EQ"}]},{"names":["clone3"],"action":"SCMP_ACT_ERRNO","errnoRet":38}]},"maskedPaths":["/proc/acpi","/proc/asound","/proc/kcore","/proc/keys","/proc/latency_stats","/proc/timer_list","/proc/timer_stats","/proc/sched_debug","/sys/firmware","/proc/scsi"],"readonlyPaths":["/proc/bus","/proc/fs","/proc/irq","/proc/sys","/proc/sysrq-trigger"]}}

Describe the results you received and expected

I get the following error:

"runc create failed: unable to start container process: error during container init: error mounting \"/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/6065696b009d70452b2b229d976df91ff2b2e3bf75c6855bb91f4f1c42a4f1e9/resolv.conf\" to rootfs at \"/etc/resolv.conf\": mount /proc/self/fd/7:/etc/resolv.conf (via /proc/self/fd/8), flags: 0x5021: operation not permitted: unknown" pod="default/user-namespace-vinaygo"

expected:

No error. Non user namespace containers are able to run.

What version of runc are you using?

runc version 1.1.4 commit: v1.1.4-0-g5fd4c4d1 spec: 1.0.2-dev go: go1.17.10 libseccomp: 2.5.4

Host OS information

NAME="Container-Optimized OS" ID=cos PRETTY_NAME="Container-Optimized OS from Google" HOME_URL="https://cloud.google.com/container-optimized-os/docs" BUG_REPORT_URL="https://cloud.google.com/container-optimized-os/docs/resources/support-policy#contact_us" GOOGLE_METRICS_PRODUCT_ID=26 KERNEL_COMMIT_ID=44456f0e9d2cd7a9616fb0d05bc4020237839a5a GOOGLE_CRASH_ID=Lakitu VERSION=101 VERSION_ID=101 BUILD_ID=17162.40.56

Host kernel information

Linux 5.15.65+ #1 SMP Sat Jan 21 10:12:05 UTC 2023 x86_64 Intel(R) Xeon(R) CPU @ 2.20GHz GenuineIntel GNU/Linux

Mar 15 '23 21:03 vinayakankugoyal

/cc @rata

Mar 15 '23 21:03 vinayakankugoyal

Thanks! I can't repro with that, though :(

The issue really seems like the same symptom that PR https://github.com/opencontainers/runc/pull/3511 fixed, but that is in 1.1.4 and you are running with 1.1.4. So maybe something with a similar symptom is still lurking there.

What I've tried so far is and failed to repro this is:

That config.json, adding a pause binary in a busybox rootfs (properly chown to the hostID in the userns mapping) and removing the dev/shm and the resolv.conf bind mounts, as the source for those doesn't exist on my computer. This works fine when running runc run --debug --systemd-cgroup mycontainer
I've tried keeping those mounts, but using /dev/shm and /etc/resolv.conf as source and run runc as before, this also works fine
I've created /mnt/test/ where test doesn't have rx permissions for others, and copy the resolv.conf file there to use as source. This works fine too
I've also chown /mnt/test to user 1:1 (so it is not owned by root, as runc is running), but this also worked fine
I've tried chaning one directory in the path to the rootfs to not have rx permissions for others (sudo chown o-rx /home/), but this doesn't fail there. The mount of the rootfs fails, so it fails before that.
I've started a k8s cluster with containerd 1.7 and runc 1.1.4 and I still don't see the issue when creating a pod with userns.

So, I can't really repro with that config. It would be great if you

Find a way to repro this on debian or some other distro that we can have easy acces on. This would help a lot
cd to the dir where the config.json file is and run: sudo runc run --debug --systemd-cgroup mycontainer and paste here what it prints (does it fail or doesn't?)
Are you using cgroups v1 or cgroups v2? Can you please explain try with both and explain how you verify that indeed you are using one or the other?
If the runc command in 2 fails, can you also run strace -f -s 512 <command from step 2>?
Can you copy the runc git repo to that host, also install bats and run: sudo bats -t tests/integration/userns.bats
Can you paste the output of running ls -ld for all paths in the resolv.conf mount? Like sudo ls -dl / /var /var/lib/ /var/lib/containerd/ /var/lib/containerd/io.containerd.grpc.v1.cri/ ...
What is your container runtime? Is it containerd or docker? If it is containerd and you are starting it with a systemd service, can you add SupplementaryGroups=0 to the systemd service, restart containerd and see if the problem still happens? This is due to this bug https://github.com/opencontainers/runc/issues/2484, that was fixed https://github.com/opencontainers/runc/commit/9c444070ec7bb83995dbc0185da68284da71c554 but introduced the regression that was fixed in 1.1.4. If you run with that, and that is causing the issue, it should work-around it. I doubt it will help, but the more information we have, the easiest to debug (specially when we can't repro).

These are the things that come to mind that might help us debug this. But it would be great if @kolyshkin can have a look.

Mar 16 '23 16:03 rata

Thanks for the detailed steps @rata! Ill run through them and report back.

Mar 16 '23 16:03 vinayakankugoyal

The container runtime is containerd 1.7.0

I still can't get things to run on an ubuntu node

cat /etc/os-release 
PRETTY_NAME="Ubuntu 22.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.1 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

uname -a
Linux 5.15.0-1024-gke #29-Ubuntu SMP Fri Dec 16 06:28:18 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Here is the error:

"Failed to create sandbox for pod" err="rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: open /proc/sys/net/core/somaxconn: no such file or directory: unknown" pod="default/user-namespace-vinaygo"

Then I cd into the directory with config.json

186816# ls
address  config.json  log  options.json  rootfs  runtime  shim-binary-path  work
root@gke-host-user-vinaygo-default-pool-b3011f78-g2jd:/tmp/rata-debug-k8s/186816# /home/kubernetes/bin/runc/runc --debug --systemd-cgroup run mycontainer
DEBU[0000] nsexec[187238]: => nsexec container setup    
DEBU[0000] nsexec[187238]: update /proc/self/oom_score_adj to '-998' 
DEBU[0000] nsexec-0[187238]: ~> nsexec stage-0          
DEBU[0000] nsexec-0[187238]: spawn stage-1              
DEBU[0000] nsexec-0[187238]: -> stage-1 synchronisation loop 
DEBU[0000] nsexec-1[187240]: ~> nsexec stage-1          
DEBU[0000] nsexec-1[187240]: unshare user namespace     
DEBU[0000] nsexec-1[187240]: request stage-0 to map user namespace 
DEBU[0000] nsexec-0[187238]: stage-1 requested userns mappings 
DEBU[0000] nsexec-0[187238]: update /proc/187240/uid_map to '0 3091988480 65536
' 
DEBU[0000] nsexec-0[187238]: update /proc/187240/gid_map to '0 3091988480 65536
' 
DEBU[0000] nsexec-1[187240]: request stage-0 to map user namespace 
DEBU[0000] nsexec-1[187240]: unshare remaining namespace (except cgroupns) 
DEBU[0000] nsexec-1[187240]: request stage-0 to send mount sources 
DEBU[0000] nsexec-0[187238]: stage-1 requested to open mount sources 
FATA[0000] nsexec-0[187238]: failed to open mount source /run/containerd/io.containerd.grpc.v1.cri/sandboxes/77e8d26e550636d55b510e943ccc24bda2d9474a3c7ad58c5acdeead4e9f15f8/shm: No such file or directory 
FATA[0000] nsexec-1[187240]: failed to receive fd from unix socket 8: Invalid argument 
ERRO[0000]utils.go:62 main.fatalWithCode() runc run failed: unable to start container process: can't get final child's PID from pipe: EOF

Then cloned runc git repo and ran bats

bats -t tests/integration/userns.bats 
1..4
ok 1 userns with simple mount
ok 2 userns with 2 inaccessible mounts
ok 3 userns with inaccessible mount + exec
not ok 4 userns with bind mount before a cgroupfs mount
# (from function `requires' in file tests/integration/helpers.bash, line 488,
#  in test file tests/integration/userns.bats, line 72)
#   `requires cgroups_v1' failed
# runc spec (status=0):
# 
# /usr/lib/bats-core/test_functions.bash: line 57: BATS_TEARDOWN_STARTED: unbound variable

Mar 17 '23 01:03 vinayakankugoyal

@vinayakankugoyal things should run on Ubuntu, it is probably some config or binary missing on your side.

The error you pasted is from containerd, that is not what we want. The runc output you pasted is not useful either, see that it says:

failed to open mount source /run/containerd/io.containerd.grpc.v1.cri/sandboxes/77e8d26e550636d55b510e943ccc24bda2d9474a3c7ad58c5acdeead4e9f15f8/shm: No such file or directory

That file doesn't exist anymore and you are not seeing the error you saw before. You will need to repro this when the file exists, or copy it and adjust the config.json (those two bind mounts, the shm and the resolv.conf).

The bats output not sure it is useful either, it seems to throw an error due to some bats variable not working. Maybe is something with your bats installation?

Also, when you have the time, please see all the things that I asked and answer with all :)

Mar 17 '23 10:03 rata

Ok I was able to get a repro by pointing the config.json to another running containers sandbox id. Now I get the same error as I was getting from kubelet on ubuntu.

root@gke-host-user-vinaygo-default-pool-b3011f78-g2jd:/tmp/rata-debug-k8s/1227175# /home/kubernetes/bin/runc.amd64 --debug --systemd-cgroup run blah
DEBU[0000] nsexec[1238579]: => nsexec container setup   
DEBU[0000] nsexec[1238579]: update /proc/self/oom_score_adj to '-998' 
DEBU[0000] nsexec-0[1238579]: ~> nsexec stage-0         
DEBU[0000] nsexec-0[1238579]: spawn stage-1             
DEBU[0000] nsexec-0[1238579]: -> stage-1 synchronisation loop 
DEBU[0000] nsexec-1[1238583]: ~> nsexec stage-1         
DEBU[0000] nsexec-1[1238583]: unshare user namespace    
DEBU[0000] nsexec-1[1238583]: request stage-0 to map user namespace 
DEBU[0000] nsexec-0[1238579]: stage-1 requested userns mappings 
DEBU[0000] nsexec-0[1238579]: update /proc/1238583/uid_map to '0 3091988480 65536
' 
DEBU[0000] nsexec-0[1238579]: update /proc/1238583/gid_map to '0 3091988480 65536
' 
DEBU[0000] nsexec-1[1238583]: request stage-0 to map user namespace 
DEBU[0000] nsexec-1[1238583]: unshare remaining namespace (except cgroupns) 
DEBU[0000] nsexec-1[1238583]: spawn stage-2             
DEBU[0000] nsexec-1[1238583]: request stage-0 to forward stage-2 pid (1238584) 
DEBU[0000] nsexec-0[1238579]: stage-1 requested pid to be forwarded 
DEBU[0000] nsexec-0[1238579]: forward stage-1 (1238583) and stage-2 (1238584) pids to runc 
DEBU[0000] nsexec-2[1]: ~> nsexec stage-2               
DEBU[0000] nsexec-1[1238583]: signal completion to stage-0 
DEBU[0000] nsexec-0[1238579]: stage-1 complete          
DEBU[0000] nsexec-0[1238579]: <- stage-1 synchronisation loop 
DEBU[0000] nsexec-0[1238579]: -> stage-2 synchronisation loop 
DEBU[0000] nsexec-0[1238579]: signalling stage-2 to run 
DEBU[0000] nsexec-1[1238583]: <~ nsexec stage-1         
DEBU[0000] nsexec-2[1]: signal completion to stage-0    
DEBU[0000] nsexec-0[1238579]: stage-2 complete          
DEBU[0000] nsexec-0[1238579]: <- stage-2 synchronisation loop 
DEBU[0000] nsexec-0[1238579]: <~ nsexec stage-0         
DEBU[0000] nsexec-2[1]: <= nsexec container setup       
DEBU[0000] nsexec-2[1]: booting up go runtime ...       
DEBU[0000] child process in init()                      
ERRO[0000]utils.go:61 main.fatalWithCode() runc run failed: unable to start container process: error during container init: open /proc/sys/net/core/somaxconn: no such file or directory

Mar 18 '23 15:03 vinayakankugoyal

I noticed the following sysctls:

"sysctl": {
      "net.core.somaxconn": "1024",
      "net.ipv4.conf.all.accept_redirects": "0",
      "net.ipv4.conf.all.forwarding": "1",
      "net.ipv4.conf.all.route_localnet": "1",
      "net.ipv4.conf.default.forwarding": "1",
      "net.ipv4.ip_forward": "1",
      "net.ipv4.tcp_fin_timeout": "60",
      "net.ipv4.tcp_keepalive_intvl": "60",
      "net.ipv4.tcp_keepalive_probes": "5",
      "net.ipv4.tcp_keepalive_time": "300",
      "net.ipv4.tcp_rmem": "4096 87380 6291456",
      "net.ipv4.tcp_syn_retries": "6",
      "net.ipv4.tcp_tw_reuse": "0",
      "net.ipv4.tcp_wmem": "4096 16384 4194304",
      "net.ipv4.udp_rmem_min": "4096",
      "net.ipv4.udp_wmem_min": "4096",
      "net.ipv6.conf.all.disable_ipv6": "1",
      "net.ipv6.conf.default.accept_ra": "0",
      "net.ipv6.conf.default.disable_ipv6": "1",
      "net.netfilter.nf_conntrack_generic_timeout": "600",
      "net.netfilter.nf_conntrack_tcp_be_liberal": "1",
      "net.netfilter.nf_conntrack_tcp_timeout_close_wait": "3600",
      "net.netfilter.nf_conntrack_tcp_timeout_established": "86400"
    },

if I remove "net.core.somaxconn": "1024", from config.json it works.

Here is what I see when I remove that sysctl

runc.amd64 --debug --systemd-cgroup run blah
DEBU[0000] nsexec[1271048]: => nsexec container setup   
DEBU[0000] nsexec[1271048]: update /proc/self/oom_score_adj to '-998' 
DEBU[0000] nsexec-0[1271048]: ~> nsexec stage-0         
DEBU[0000] nsexec-0[1271048]: spawn stage-1             
DEBU[0000] nsexec-0[1271048]: -> stage-1 synchronisation loop 
DEBU[0000] nsexec-1[1271052]: ~> nsexec stage-1         
DEBU[0000] nsexec-1[1271052]: unshare user namespace    
DEBU[0000] nsexec-1[1271052]: request stage-0 to map user namespace 
DEBU[0000] nsexec-1[1271052]: request stage-0 to map user namespace 
DEBU[0000] nsexec-0[1271048]: stage-1 requested userns mappings 
DEBU[0000] nsexec-0[1271048]: update /proc/1271052/uid_map to '0 3091988480 65536
' 
DEBU[0000] nsexec-0[1271048]: update /proc/1271052/gid_map to '0 3091988480 65536
' 
DEBU[0000] nsexec-1[1271052]: unshare remaining namespace (except cgroupns) 
DEBU[0000] nsexec-1[1271052]: spawn stage-2             
DEBU[0000] nsexec-1[1271052]: request stage-0 to forward stage-2 pid (1271053) 
DEBU[0000] nsexec-0[1271048]: stage-1 requested pid to be forwarded 
DEBU[0000] nsexec-0[1271048]: forward stage-1 (1271052) and stage-2 (1271053) pids to runc 
DEBU[0000] nsexec-1[1271052]: signal completion to stage-0 
DEBU[0000] nsexec-0[1271048]: stage-1 complete          
DEBU[0000] nsexec-0[1271048]: <- stage-1 synchronisation loop 
DEBU[0000] nsexec-0[1271048]: -> stage-2 synchronisation loop 
DEBU[0000] nsexec-0[1271048]: signalling stage-2 to run 
DEBU[0000] nsexec-1[1271052]: <~ nsexec stage-1         
DEBU[0000] nsexec-2[1]: ~> nsexec stage-2               
DEBU[0000] nsexec-2[1]: signal completion to stage-0    
DEBU[0000] nsexec-2[1]: <= nsexec container setup       
DEBU[0000] nsexec-2[1]: booting up go runtime ...       
DEBU[0000] nsexec-0[1271048]: stage-2 complete          
DEBU[0000] nsexec-0[1271048]: <- stage-2 synchronisation loop 
DEBU[0000] nsexec-0[1271048]: <~ nsexec stage-0         
DEBU[0000] child process in init()                      
DEBU[0000] seccomp: prepending -ENOSYS stub filter to user filter... 
DEBU[0000]   [   0] ld [4]                              
DEBU[0000]   [   1] jeq #1073741827,2                   
DEBU[0000]   [   2] jeq #3221225534,4                   
DEBU[0000]   [   3] ja 10                               
DEBU[0000]   [   4] ld [0]                              
DEBU[0000]   [   5] jgt #449,7                          
DEBU[0000]   [   6] ja 7                                
DEBU[0000]   [   7] ld [0]                              
DEBU[0000]   [   8] jset #1073741824,1                  
DEBU[0000]   [   9] jgt #449,3,1                        
DEBU[0000]   [  10] jgt #1073742371,2                   
DEBU[0000]   [  11] ja 2                                
DEBU[0000]   [  12] ja 1                                
DEBU[0000]   [  13] ret #327718                         
DEBU[0000]   [....] --- original filter ---             
DEBU[0000] init: closing the pipe to signal completion

Mar 18 '23 16:03 vinayakankugoyal

Non usernamespace containers also have that sysctl in their config.json and those are coming up fine.

Mar 18 '23 22:03 vinayakankugoyal

@vinayakankugoyal Hmm, that is not the same error, isn't it? I mean, the one you mentioned here: open /proc/sys/net/core/somaxconn: no such file or directory.

That is not the same error you reported in the original issue description. Am I missing something?

Please try to create a repro for the issue in the original issue. As I mentioned, I tried several scenarios but couldn't repro, but it seems you have a setup that you can repro, so let's see how is it that you hit it.

Mar 20 '23 11:03 rata

I am not able to repro the same issue that I am seeing on the COS based nodes on the Ubuntu based nodes.

On COS I am seeing:

ERRO[0000]utils.go:61 main.fatalWithCode() runc run failed: unable to start container process: error during container init: error mounting "/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/13780a324d8f2a5d2cf886a5b7b2cf549be345626e2405ef3b3a38e862fe27cf/resolv.conf" to rootfs at "/etc/resolv.conf": mount /proc/self/fd/7:/etc/resolv.conf (via /proc/self/fd/8), flags: 0x5021: operation not permitted

On Ubuntu nodes I am seeing:

open /proc/sys/net/core/somaxconn: no such file or directory

However I did the same changed on the COS node and now I get the original error. I also ran strace and here is the output filtered to anything that had resolv.conf in it.

cat strace.txt | grep resolv.conf
314004 read(3, "c\",\"nodev\",\"ro\"]},{\"destination\":\"/dev/shm\",\"type\":\"bind\",\"source\":\"/run/containerd/io.containerd.grpc.v1.cri/sandboxes/13780a324d8f2a5d2cf886a5b7b2cf549be345626e2405ef3b3a38e862fe27cf/shm\",\"options\":[\"rbind\",\"ro\",\"nosuid\",\"nodev\",\"noexec\"]},{\"destination\":\"/etc/resolv.conf\",\"type\":\"bind\",\"source\":\"/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/13780a324d8f2a5d2cf886a5b7b2cf549be345626e2405ef3b3a38e862fe27cf/resolv.conf\",\"options\":[\"rbind\",\"ro\"]}],\"annotations\":{\"io.kubernetes.cri.container-type\":\""..., 2048) = 2048
314013 write(14, "l\1\0\0000\362\1\0\1\0\0\0\0\0\0\0\10\0\221j\0\0\2|\30\0\223j0 3975479296 65536\n\0\30\0\224j0 3975479296 65536\n\0\10\0\225j\1\0\0\0\t\0\226j-998\0\0\0\0\10\0\227j\0\0\0\0\10\1\232j\0\0\0\0\0/run/containerd/io.containerd.grpc.v1.cri/sandboxes/13780a324d8f2a5d2cf886a5b7b2cf549be345626e2405ef3b3a38e862fe27cf/shm\0/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/13780a324d8f2a5d2cf886a5b7b2cf549be345626e2405ef3b3a38e862fe27cf/resolv.conf\0\0", 364 <unfinished ...>
314010 read(3, "\10\0\221j\0\0\2|\30\0\223j0 3975479296 65536\n\0\30\0\224j0 3975479296 65536\n\0\10\0\225j\1\0\0\0\t\0\226j-998\0\0\0\0\10\0\227j\0\0\0\0\10\1\232j\0\0\0\0\0/run/containerd/io.containerd.grpc.v1.cri/sandboxes/13780a324d8f2a5d2cf886a5b7b2cf549be345626e2405ef3b3a38e862fe27cf/shm\0/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/13780a324d8f2a5d2cf886a5b7b2cf549be345626e2405ef3b3a38e862fe27cf/resolv.conf\0\0", 348) = 348
314010 openat(AT_FDCWD, "/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/13780a324d8f2a5d2cf886a5b7b2cf549be345626e2405ef3b3a38e862fe27cf/resolv.conf", O_RDONLY|O_CLOEXEC|O_PATH <unfinished ...>
314017 newfstatat(AT_FDCWD, "/tmp/runc-wrapper-debug-k8s/74237/rootfs/etc/resolv.conf", {st_mode=S_IFREG|0755, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
314017 newfstatat(AT_FDCWD, "/tmp/runc-wrapper-debug-k8s/74237/rootfs/etc/resolv.conf", {st_mode=S_IFREG|0755, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
314017 newfstatat(AT_FDCWD, "/tmp/runc-wrapper-debug-k8s/74237/rootfs/etc/resolv.conf", {st_mode=S_IFREG|0755, st_size=0, ...}, 0) = 0
314017 newfstatat(AT_FDCWD, "/tmp/runc-wrapper-debug-k8s/74237/rootfs/etc/resolv.conf", {st_mode=S_IFREG|0755, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
314017 openat(AT_FDCWD, "/tmp/runc-wrapper-debug-k8s/74237/rootfs/etc/resolv.conf", O_RDONLY|O_CLOEXEC|O_PATH) = 8
314017 readlinkat(AT_FDCWD, "/proc/self/fd/8", "/tmp/runc-wrapper-debug-k8s/74237/rootfs/etc/resolv.conf", 128) = 56
314017 newfstatat(AT_FDCWD, "/tmp/runc-wrapper-debug-k8s/74237/rootfs/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=77, ...}, AT_SYMLINK_NOFOLLOW) = 0
314017 openat(AT_FDCWD, "/tmp/runc-wrapper-debug-k8s/74237/rootfs/etc/resolv.conf", O_RDONLY|O_CLOEXEC|O_PATH) = 8
314017 readlinkat(AT_FDCWD, "/proc/self/fd/8", "/tmp/runc-wrapper-debug-k8s/74237/rootfs/etc/resolv.conf", 128) = 56
314017 newfstatat(AT_FDCWD, "/tmp/runc-wrapper-debug-k8s/74237/rootfs/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=77, ...}, AT_SYMLINK_NOFOLLOW) = 0
314017 openat(AT_FDCWD, "/tmp/runc-wrapper-debug-k8s/74237/rootfs/etc/resolv.conf", O_RDONLY|O_CLOEXEC|O_PATH) = 8
314017 readlinkat(AT_FDCWD, "/proc/self/fd/8", "/tmp/runc-wrapper-debug-k8s/74237/rootfs/etc/resolv.conf", 128) = 56
314017 write(3, "{\"message\":\"error mounting \\\"/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/13780a324d8f2a5d2cf886a5b7b2cf549be345626e2405ef3b3a38e862fe27cf/resolv.conf\\\" to rootfs at \\\"/etc/resolv.conf\\\": mount /proc/self/fd/7:/etc/resolv.conf (via /proc/self/fd/8), flags: 0x5021: operation not permitted\"}", 301 <unfinished ...>
314013 <... read resumed>"{\"message\":\"error mounting \\\"/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/13780a324d8f2a5d2cf886a5b7b2cf549be345626e2405ef3b3a38e862fe27cf/resolv.conf\\\" to rootfs at \\\"/etc/resolv.conf\\\": mount /proc/self/fd/7:/etc/resolv.conf (via /proc/self/fd/8), flags: 0x5021: operation not permitted\"}", 512) = 301
314012 write(2, "\33[31mERRO\33[0m[0000]utils.go:61 main.fatalWithCode() runc run failed: unable to start container process: error during container init: error mounting \"/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/13780a324d8f2a5d2cf886a5b7b2cf549be345626e2405ef3b3a38e862fe27cf/resolv.conf\" to rootfs at \"/etc/resolv.conf\": mount /proc/self/fd/7:/etc/resolv.conf (via /proc/self/fd/8), flags: 0x5021: operation not permitted \n", 418) = 418

This is how the directory permissions are setup:

# ls -ld / /var /var/lib/ /var/lib/containerd/ /var/lib/containerd/io.containerd.grpc.v1.cri/ /var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/ /var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/13780a324d8f2a5d2cf886a5b7b2cf549be345626e2405ef3b3a38e862fe27cf/ /var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/13780a324d8f2a5d2cf886a5b7b2cf549be345626e2405ef3b3a38e862fe27cf/resolv.conf 
drwxr-xr-x 20 root root 4096 Jan 21 10:59 /
drwxr-xr-x  9 root root 4096 Mar 20 16:45 /var
drwxr-xr-x 23 root root 4096 Mar 20 22:00 /var/lib/
drwxr-xr-x 12 root root 4096 Mar 20 20:18 /var/lib/containerd/
drwxr-xr-x  4 root root 4096 Mar 20 16:48 /var/lib/containerd/io.containerd.grpc.v1.cri/
drwxr-xr-x 16 root root 4096 Mar 20 22:08 /var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/
drwxr-xr-x  2 root root 4096 Mar 20 16:48 /var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/13780a324d8f2a5d2cf886a5b7b2cf549be345626e2405ef3b3a38e862fe27cf/
-rw-r--r--  1 root root   77 Mar 20 16:48 /var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/13780a324d8f2a5d2cf886a5b7b2cf549be345626e2405ef3b3a38e862fe27cf/resolv.conf

Mar 20 '23 22:03 vinayakankugoyal

@vinayakankugoyal if you can repro with runc run, can you then do all the things I asked to instead of just these? The strace uploaded to some site would be nice, the grep doesn't seem to show useful information.

Also, though, you might have not reproduced the issue. The strace output you put does not have any mount calls, and we do know it is failing in the mount call (technically it can be failing in the prior checks, but seems unlikely as that mount works without userns)

Mar 21 '23 16:03 rata

your repro is too confusing, but if your kernel is v5.15 and set sysctl in a new usernamespace, may be you ran into this problem:

net: Don't export sysctls to unprivileged users

		tbl[0].data = &net->core.sysctl_somaxconn;

		/* Don't export any sysctls to unprivileged users */
		if (net->user_ns != &init_user_ns) {
			tbl[0].procname = NULL;
		}

this code is in v5.15

Mar 22 '23 02:03 g0dA

@rata - here is the strace output. strace.txt

Mar 22 '23 21:03 vinayakankugoyal

@vinayakankugoyal cool. Can you paste the other things I asked here?

Sorry I'm repeating this in a loop, but you continue to answer with partial information and not comment at all on doing the other things I asked. If you planned to do it later, please let me know, so I don't ask you repeatedly.

Mar 23 '23 11:03 rata

@vinayakankugoyal So, let's go one report at a time.

For Ubuntu 22.04: I can't reproduce. I've installed a VM with Ubuntu 22.04, run apt dist-upgrade to have the latest versions. This is what I did, because I'm very used to the development setup, no other special reason.

apt install runc # this installs runc 1.1.4 from ubuntu repos. Make sure you are installing 1.1.4, you need the jammy-updates apt repo to install this, otherwise it will install 1.1.0
Install containerd 1.7.0 from official binaries, as explained here: https://github.com/containerd/containerd/blob/main/docs/getting-started.md#option-1-from-the-official-binaries
clone containerd repo and run this script, to configure the CNI: https://github.com/containerd/containerd/blob/main/script/setup/install-cni
Created a containerd config file with: containerd config defaults > config.toml
Changed the root, state and grpc.address variables so it doesn't conflict with the system containerd (in case you have one)
In my case I used: root = "/var/lib/containerd-rata", state = "/run/containerd-rata", and for the grpc section address = "/run/containerd-rata/containerd.sock"
Note it is okay that these folders don't exist, they will be created by containerd later
start containerd and leave it running: sudo containerd --config config.toml
Install go 1.20.2 from official binaries (you can find instructions here: https://go.dev/doc/install)
Clone kubernetes repo: https://github.com/kubernetes/kubernetes
Switch to the k8s 1.26.1 tag (as you told me you were using): git checkout v1.26.1
In the terminal you will later start kubernetes do (or adjust the vars names to yours):

export CONTAINER_RUNTIME_ENDPOINT=/run/containerd-rata/containerd.sock
export IMAGE_SERVICE_ENDPOINT=/run/containerd-rata/containerd.sock
export CONTAINER_RUNTIME=remote

apt install make
start kubernetes: hack/local-up-cluster.sh
The first time that command will ask you to run other commands to install etcd, etc. Follow those steps and run hack/local-up-cluster.sh again

This will start a k8s cluster and you should see in your terminal running containerd some activity (it will try to create the coredns pod).

In this setup, I've applied the pod you sent me via slack:

apiVersion: v1
kind: Pod
metadata:
  name: namespace-user-vinaygo
spec:
  hostUsers: false
  containers:
  - name: namespace-user-vinaygo
    image: debian
    command:
    - sleep
    - infinity

You can download the kubectl binary and, as the k8s start script printed, you can export KUBECONFIG=/var/run/kubernetes/admin.kubeconfig and use kubectl that will just work with that cluster.

The pod was created just fine and with userns:

$ ./kubectl exec -ti namespace-user-vinaygo -- bash
root@namespace-user-vinaygo:/# cat /proc/self/uid_map 
         0 1473642496      65536

Note that in this ubuntu version there is a bug (it seems to be fixed in latest debian, will probably come in ubuntu 23.04) where hitting ctrl-c doesn't work to stop the k8s cluster. You can kill the processes if you are using a vm via ssh by running: ~.. This will kill the ssh session and that usually sends a SIGHUP that will kill the processes. But make sure all the processes are died before starting a k8s again, as otherwise you will see weird errors (certificates are regenerated, so auth will fail and all the world collides in weird ways). Maybe something like this helps to kill them all, but verify or just reboot the server if you are unsure :-D: kill $(ps faux | grep hack | grep bash | awk '{ print $2 }'); sudo pkill -f kube-scheduler; sudo rm /var/run/kubernetes/kubelet-rotated.kubeconfig

Due to that, you might need to run this after killing all the k8s processes but before starting it again: sudo chown $UID:$UID -R /var/run/kubernetes/ and sudo chown $UID:$UID /tmp/kube-apiserver-audit.log. I've submitted fixes for this in k8s already, but not present in 1.26 :)

If you do this, does it work for you? Can you try to find what is the difference between this and the ubuntu node you are using? And care to share how is it installed (all components, the OS, the container runtime, CNI, runc, etc.)?

Regarding the bats error you pasted here:

This is because you installed an old version of bats. If you install latest from source, it will not throw that error: https://bats-core.readthedocs.io/en/stable/installation.html#any-os-installing-bats-from-source

The tests run fine in ubuntu 22.04 with a new version of bats (expected, as that is tested in the CI too IIRC).

Regarding the config.json you sent me on slack: config-slack.txt. This is a config.json that was created on COS when you hit the issue in the k8s cluster. With that config.json I could repro the issue in Ubuntu 22.04. But it seems like a red herring.

First, I deleted the somax sysctl line, that isn't added on Ubuntu when I started my k8s cluster here (I guess COS has some specific config to add those?).

When the mount is pointing to the host /etc/resolv.conf, in ubuntu 22.04 (at least in this default config on an Azure VM) that file is a symlink. If you copy the file (cp /etc/resolv.conf .) and then point to this new file, the container starts fine.

It also starts fine if you keep the host /etc/resolv.conf and change the option to add "nosuid","nodev", "noexec".

Do you have a config.json generated in your k8s your setup when it fails with Ubuntu?

Also, can you try in COS if adding those options to the mount makes it work?

I'm out of ideas on how I can reproduce this. Besides the missing things that you will send when you can, also:

Can you confirm if you install a k8s cluster with runc 1.1.4, containerd 1.7.0 and k8s 1.25 or 1.26 (NOT 1.27) in Ubuntu 22.04 that creating a pod with user namespaces like the one here works?
Can you confirm that the Ubuntu issue is only when running runc manually with the config.json generated on COS?

Regarding COS, whenever you send the other info we'll have more insight into what might be happening. My gut feeling now is that it might be configured to use some options in the mount that don't work with userns, although not sure how they achieve to add those options in the config.json.

But let me know if in your setup with Ubuntu this works fine (is this a GKE cluster?).

Mar 24 '23 17:03 rata

Thanks for all the details!

Ubuntu

Like I mentioned in chat, Ubuntu works if I remove the somaxconn sysctl. It seems like in GKE kubelet was adding that to all pods, and I turned that "feature" off. Now the pod is able to come up just fine.

Ill have to followup on whether somaxconn is intended to work for user namespace pods or not.

COS

For COS when I update the option for the /etc/resolv.conf mount with "nosuid","nodev", "noexec" in the config.json file I shared earlier, it is now able to mount /etc/resolv.conf but now I get a new failure:

runc --debug --systemd-cgroup run mycontainer
DEBU[0000] nsexec[14931]: => nsexec container setup     
DEBU[0000] nsexec[14931]: update /proc/self/oom_score_adj to '-998' 
DEBU[0000] nsexec-0[14931]: ~> nsexec stage-0           
DEBU[0000] nsexec-0[14931]: spawn stage-1               
DEBU[0000] nsexec-0[14931]: -> stage-1 synchronisation loop 
DEBU[0000] nsexec-1[14935]: ~> nsexec stage-1           
DEBU[0000] nsexec-1[14935]: unshare user namespace      
DEBU[0000] nsexec-1[14935]: request stage-0 to map user namespace 
DEBU[0000] nsexec-0[14931]: stage-1 requested userns mappings 
DEBU[0000] nsexec-0[14931]: update /proc/14935/uid_map to '0 2121269248 65536
' 
DEBU[0000] nsexec-0[14931]: update /proc/14935/gid_map to '0 2121269248 65536
' 
DEBU[0000] nsexec-1[14935]: request stage-0 to map user namespace 
DEBU[0000] nsexec-1[14935]: unshare remaining namespace (except cgroupns) 
DEBU[0000] nsexec-1[14935]: spawn stage-2               
DEBU[0000] nsexec-1[14935]: request stage-0 to forward stage-2 pid (14936) 
DEBU[0000] nsexec-0[14931]: stage-1 requested pid to be forwarded 
DEBU[0000] nsexec-0[14931]: forward stage-1 (14935) and stage-2 (14936) pids to runc 
DEBU[0000] nsexec-2[1]: ~> nsexec stage-2               
DEBU[0000] nsexec-1[14935]: signal completion to stage-0 
DEBU[0000] nsexec-0[14931]: stage-1 complete            
DEBU[0000] nsexec-0[14931]: <- stage-1 synchronisation loop 
DEBU[0000] nsexec-0[14931]: -> stage-2 synchronisation loop 
DEBU[0000] nsexec-0[14931]: signalling stage-2 to run   
DEBU[0000] nsexec-1[14935]: <~ nsexec stage-1           
DEBU[0000] nsexec-2[1]: signal completion to stage-0    
DEBU[0000] nsexec-0[14931]: stage-2 complete            
DEBU[0000] nsexec-0[14931]: <- stage-2 synchronisation loop 
DEBU[0000] nsexec-0[14931]: <~ nsexec stage-0           
DEBU[0000] nsexec-2[1]: <= nsexec container setup       
DEBU[0000] nsexec-2[1]: booting up go runtime ...       
DEBU[0000] child process in init()                      
DEBU[0000]libcontainer/cgroups/systemd/common.go:296 libcontainer/cgroups/systemd.generateDeviceProperties() skipping device /dev/char/10:200 for systemd: stat /dev/char/10:200: no such file or directory 
ERRO[0000]utils.go:61 main.fatalWithCode() runc run failed: unable to start container process: exec /pause: permission denied

Mar 24 '23 18:03 vinayakankugoyal

Thanks for all the details!

You are welcome.

But please, please, PLEASE understand that I'm spending a lot of time on this, and in part is due to your concise bug reports and replies you post here, that just don't say enough. With Ubuntu I can spin a server and spend a lot of time (even though having to spend lot of time has a significant impact on my day to day tasks), but with COS that is either a closed-source or only runs on google cloud, this is of course even way more difficult.

For example, if you had mentioned that in Ubuntu the sysctl was the only issue, that is a feature you have on GKE and that you can turn it off, and when you do that then all works fine, that would have saved me several HOURS to try it, write the elaborated post I did here with clear step-by-step instructions, etc.

Another thing that will help is if you say exactly how you run something to produce some output. If it is a k8s cluster, then any details needed in the setup, etc. Talking more in general, please think others don't know anything else than what you write. So explaining exactly what you did is critical, and maybe also ask yourself some questions before submitting, like: is this enough for someone on another laptop to reproduce what I have here, or is something missing or that can be interpreted in some other way? Am I being as clear as I can with what I write?

I'd really need you to start answering the questions I ask, if you can't answer some now and plan to answer them later, please do say so. And follow on what you said you will do later (so far so said you will follow up on some, but you didn't and I don't know if you consider that is not relevant anymore or you will do it later; it seems weird as for some things you do spend time, but for others you don't and I don't understand).

Also, I think if you don't know why something fixes something for you, then don't open PRs doing that change. We are debugging, we need first to understand what is happening (and we will find ways were things work while debugging, maybe more than one), and only then we can propose a fix (if any is really needed). If we do try something for debugging, that seems to help with some thing and you don't really understand why it helps nor it fixes the problem, then opening a PR for that is not the flow that I expect. Let's debug and understand first. We can open PRs later.

Like I mentioned in chat, Ubuntu works if I remove the somaxconn sysctl. It seems like in GKE kubelet was adding that to all pods, and I turned that "feature" off. Now the pod is able to come up just fine.

This is not what I understood at all of what you said in the chat. but great it works! I'll need you once again to be more verbose here. How do you disable the "feature" in GKE? This bug report can be useful for others only if you share this.

Also, is that enabled by default on GKE? Is it part of some kubernetes upstream project? Or how is GKE adding that?

Ill have to followup on whether somaxconn is intended to work for user namespace pods or not.

Cool, but please do follow-up on this.

Regarding COS, is the filesystem mounting the resolv.conf from when running from k8s, is mounted with those flags?

If not that, is the file a symlink or is there a symlink in any of the paths components to the resolv.conf?

Also, you didn't mention anything at all, but did that containerd patch make the flow from k8s work? I mean create a pod from k8s, using the patched containerd, and the pod is started with userns.

Regarding the last issue you pasted about permission denied, can you try if it is fixed with runc from git with this PR applied? https://github.com/opencontainers/runc/pull/3753

Mar 27 '23 15:03 rata

@rata - I appreciate your help on this but please allow my to clarify.

I have mentioned both in chat and in the report now that Ubuntu seems to be working once I remove the somaxconn sysctl. I had posted that Ubuntu works if I remove the sysctl a week ago. I did not know why those sysctls were being added in GKE myself and only learnt about that feature recently and am still following up on how that can be turned off by someone. AFAIK there isn't a way to turn it off in GKE, other than messing with the kubelet config which is what I did but I am following up with the GKE team to understand more.

I initially repro'd this issue on COS and only switch the repo to Ubuntu because in comment you mentioned:

Find a way to repro this on debian or some other distro that we can have easy acces on. This would help a lot

Again sorry for the confusion about this and the wasted hours. But if you are not clear on some details in chat please don't hesitate to clarify. You have been extremely gracious with your time and I don't want you to waste it because of miscommunication.

Let me try to give as much details as I can about my setup now:

Ubuntu

I created a GKE cluster using the following command:

gcloud container clusters create host-user-vinaygo-ubuntu --num-nodes=1 --cluster-version=1.26.1-gke.1500 --enable-kubernetes-alpha --no-enable-autorepair --no-enable-autoupgrade --release-channel=rapid --image-type=ubuntu_containerd

Notice that the cluster above only has 1 node. After the cluster came up I ssh'd into the node and did the following:

Ubuntu Version

cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.1 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

uname -a
Linux gke-host-user-vinaygo-ub-default-pool-03c51c97-2nds 5.15.0-1024-gke #29-Ubuntu SMP Fri Dec 16 06:28:18 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Install `containerd` 1.7.0

I ran the following commands under /home/kubernetes/bin/

wget -c https://github.com/containerd/containerd/releases/download/v1.7.0/containerd-1.7.0-linux-amd64.tar.gz

tar -xzf containerd-1.7.0-linux-amd64.tar.gz

mount --bind /home/kubernetes/bin/bin/containerd /usr/bin/containerd

systemctl restart containerd.service

Install `runc` 1.1.4

I ran the following commands under /home/kubernetes/bin/

wget -c https://github.com/opencontainers/runc/releases/download/v1.1.4/runc.amd64

chmod u+x /home/kubernetes/bin/runc.amd64

mount --bind /home/kubernetes/bin/runc.amd64 /usr/sbin/runc

Updated kubelet to not add somaxconn to any pod

sed -i 's/net.core.somaxconn=1024,//g' /etc/default/kubelet

Created a Pod

Now that the node was setup correctly I created the following Pod.

gcloud container clusters get-credentials host-user-vinaygo-ubuntu

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: namespace-user-vinaygo
spec:
  hostUsers: false
  containers:
  - name: namespace-user-vinaygo
    image: debian
    command:
    - sleep
    - infinity
 EOF

Note that this pod comes up fine.

COS

I created a GKE cluster using the following command:

gcloud container clusters create host-user-vinaygo-cos --num-nodes=1 --cluster-version=1.26.1-gke.1500 --enable-kubernetes-alpha --no-enable-autorepair --no-enable-autoupgrade --release-channel=rapid --image-type=cos_containerd

Notice that the cluster above only has 1 node. After the cluster came up I ssh'd into the node and did the following:

COS version

NAME="Container-Optimized OS"
ID=cos
PRETTY_NAME="Container-Optimized OS from Google"
HOME_URL="https://cloud.google.com/container-optimized-os/docs"
BUG_REPORT_URL="https://cloud.google.com/container-optimized-os/docs/resources/support-policy#contact_us"
GOOGLE_METRICS_PRODUCT_ID=26
KERNEL_COMMIT_ID=44456f0e9d2cd7a9616fb0d05bc4020237839a5a
GOOGLE_CRASH_ID=Lakitu
VERSION=101
VERSION_ID=101
BUILD_ID=17162.40.56

uname -a
Linux gke-host-user-vinaygo-co-default-pool-56de25b8-1kzv 5.15.65+ #1 SMP Sat Jan 21 10:12:05 UTC 2023 x86_64 Intel(R) Xeon(R) CPU @ 2.20GHz GenuineIntel GNU/Linux

Install `containerd` 1.7.0

I ran the following commands under /home/kubernetes/bin/

wget -c https://github.com/containerd/containerd/releases/download/v1.7.0/containerd-1.7.0-linux-amd64.tar.gz

tar -xzf containerd-1.7.0-linux-amd64.tar.gz

mount --bind /home/kubernetes/bin/bin/containerd /usr/bin/containerd

systemctl restart containerd.service

Install `runc` 1.1.4 but also store the config.json before calling runc

I ran the following commands under /home/kubernetes/bin/

wget -c https://github.com/opencontainers/runc/releases/download/v1.1.4/runc.amd64

chmod u+x /home/kubernetes/bin/runc.amd64

cat > runcwrapper << 'EOF'
#!/bin/bash

echo "Starting my runc: $(date)" >> /tmp/runc-wrapper.log

echo "The command line args are $@" >> /tmp/runc-wrapper.log

if [ "${9}" = "--bundle" ]; then
       echo "Getting config.json"  >> /tmp/runc-wrapper.log
       mkdir -p /tmp/runc-wrapper-debug-k8s/
       cp -ar "${10}" "/tmp/runc-wrapper-debug-k8s/$$/"
fi

exec /home/kubernetes/bin/runc.amd64 --debug "$@"
EOF

chmod u+x /home/kubernetes/bin/runcwrapper

mount --bind /home/kubernetes/bin/runcwrapper /usr/bin/runc

Updated kubelet to not add somaxconn to any pod

sed -i 's/net.core.somaxconn=1024,//g' /etc/default/kubelet

Created a Pod

Now that the node was setup correctly I created the following Pod.

gcloud container clusters get-credentials host-user-vinaygo-cos

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: namespace-user-vinaygo
spec:
  hostUsers: false
  containers:
  - name: namespace-user-vinaygo
    image: debian
    command:
    - sleep
    - infinity
EOF

The pod is stuck in CreatingContainer state.

Now I investigate the config.json

ls /tmp/runc-wrapper-debug-k8s/
8917  9004  9119

cd /tmp/runc-wrapper-debug-k8s/9119

ls -la
total 24
drwx--x---  3 root       1354629120  200 Mar 27 21:05 .
drwxr-xr-x 31 root       root        620 Mar 27 21:11 ..
-rw-r--r--  1 root       root         89 Mar 27 21:05 address
-rw-r--r--  1 root       root       9722 Mar 27 21:05 config.json
prwx------  1 root       root          0 Mar 27 21:05 log
-rw-------  1 root       root         23 Mar 27 21:05 options.json
drwxr-xr-x  2 1354629120 1354629120   80 Mar 27 21:05 rootfs
-rw-------  1 root       root          0 Mar 27 21:05 runtime
-rw-------  1 root       root         32 Mar 27 21:05 shim-binary-path
lrwxrwxrwx  1 root       root        121 Mar 27 21:05 work -> /var/lib/containerd/io.containerd.runtime.v2.task/k8s.io/868a8dcae720dbef6d70ca24f8df075b231d70c4d4a2868369dac11577cf4e02

Now I first try to run this config.json

I run the following command in /tmp/runc-wrapper-debug-k8s/9119 which has the config.json file

/tmp/runc-wrapper-debug-k8s/9119 # /home/kubernetes/bin/runc.amd64 --debug --systemd-cgroup run mycontainer 
DEBU[0000] nsexec[13009]: => nsexec container setup     
DEBU[0000] nsexec[13009]: update /proc/self/oom_score_adj to '-998' 
DEBU[0000] nsexec-0[13009]: ~> nsexec stage-0           
DEBU[0000] nsexec-0[13009]: spawn stage-1               
DEBU[0000] nsexec-0[13009]: -> stage-1 synchronisation loop 
DEBU[0000] nsexec-1[13014]: ~> nsexec stage-1           
DEBU[0000] nsexec-1[13014]: unshare user namespace      
DEBU[0000] nsexec-1[13014]: request stage-0 to map user namespace 
DEBU[0000] nsexec-0[13009]: stage-1 requested userns mappings 
DEBU[0000] nsexec-0[13009]: update /proc/13014/uid_map to '0 1354629120 65536
' 
DEBU[0000] nsexec-0[13009]: update /proc/13014/gid_map to '0 1354629120 65536
' 
DEBU[0000] nsexec-1[13014]: request stage-0 to map user namespace 
DEBU[0000] nsexec-1[13014]: unshare remaining namespace (except cgroupns) 
FATA[0000] nsexec-0[13009]: failed to open mount source /run/containerd/io.containerd.grpc.v1.cri/sandboxes/868a8dcae720dbef6d70ca24f8df075b231d70c4d4a2868369dac11577cf4e02/shm: No such file or directory 
FATA[0000] nsexec-1[13014]: failed to receive fd from unix socket 8: Invalid argument 
ERRO[0000]utils.go:61 main.fatalWithCode() runc run failed: unable to start container process: can't get final child's PID from pipe: EOF

This error is only because containerd cleaned up the sandbox.

So I update the config.json by:-

sed -i 's@/run/containerd/io.containerd.grpc.v1.cri/sandboxes/868a8dcae720dbef6d70ca24f8df075b231d70c4d4a2868369dac11577cf4e02/shm@/dev/shm@g' config.json

now I reurn

/tmp/runc-wrapper-debug-k8s/9119 # /home/kubernetes/bin/runc.amd64 --debug --systemd-cgroup run mycontainer 
DEBU[0000] nsexec[14379]: => nsexec container setup     
DEBU[0000] nsexec[14379]: update /proc/self/oom_score_adj to '-998' 
DEBU[0000] nsexec-0[14379]: ~> nsexec stage-0           
DEBU[0000] nsexec-0[14379]: spawn stage-1               
DEBU[0000] nsexec-0[14379]: -> stage-1 synchronisation loop 
DEBU[0000] nsexec-1[14381]: ~> nsexec stage-1           
DEBU[0000] nsexec-1[14381]: unshare user namespace      
DEBU[0000] nsexec-1[14381]: request stage-0 to map user namespace 
DEBU[0000] nsexec-1[14381]: request stage-0 to map user namespace 
DEBU[0000] nsexec-0[14379]: stage-1 requested userns mappings 
DEBU[0000] nsexec-0[14379]: update /proc/14381/uid_map to '0 1354629120 65536
' 
DEBU[0000] nsexec-0[14379]: update /proc/14381/gid_map to '0 1354629120 65536
' 
DEBU[0000] nsexec-1[14381]: unshare remaining namespace (except cgroupns) 
FATA[0000] nsexec-0[14379]: failed to open mount source /var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/868a8dcae720dbef6d70ca24f8df075b231d70c4d4a2868369dac11577cf4e02/resolv.conf: No such file or directory 
FATA[0000] nsexec-1[14381]: failed to receive fd from unix socket 8: Invalid argument 
ERRO[0000]utils.go:61 main.fatalWithCode() runc run failed: unable to start container process: can't get final child's PID from pipe: EOF

This error is because containerd cleaned up the sandbox.

So I update the config.json by:-

sed -i 's@/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/868a8dcae720dbef6d70ca24f8df075b231d70c4d4a2868369dac11577cf4e02/resolv.conf@/etc/resolv.conf@g' config.json

now I rerun the following

/tmp/runc-wrapper-debug-k8s/9119 # /home/kubernetes/bin/runc.amd64 --debug --systemd-cgroup run mycontainer 
DEBU[0000] nsexec[15862]: => nsexec container setup     
DEBU[0000] nsexec[15862]: update /proc/self/oom_score_adj to '-998' 
DEBU[0000] nsexec-0[15862]: ~> nsexec stage-0           
DEBU[0000] nsexec-0[15862]: spawn stage-1               
DEBU[0000] nsexec-0[15862]: -> stage-1 synchronisation loop 
DEBU[0000] nsexec-1[15868]: ~> nsexec stage-1           
DEBU[0000] nsexec-1[15868]: unshare user namespace      
DEBU[0000] nsexec-1[15868]: request stage-0 to map user namespace 
DEBU[0000] nsexec-0[15862]: stage-1 requested userns mappings 
DEBU[0000] nsexec-0[15862]: update /proc/15868/uid_map to '0 1354629120 65536
' 
DEBU[0000] nsexec-0[15862]: update /proc/15868/gid_map to '0 1354629120 65536
' 
DEBU[0000] nsexec-1[15868]: request stage-0 to map user namespace 
DEBU[0000] nsexec-1[15868]: unshare remaining namespace (except cgroupns) 
DEBU[0000] nsexec-1[15868]: spawn stage-2               
DEBU[0000] nsexec-1[15868]: request stage-0 to forward stage-2 pid (15869) 
DEBU[0000] nsexec-0[15862]: stage-1 requested pid to be forwarded 
DEBU[0000] nsexec-0[15862]: forward stage-1 (15868) and stage-2 (15869) pids to runc 
DEBU[0000] nsexec-2[1]: ~> nsexec stage-2               
DEBU[0000] nsexec-1[15868]: signal completion to stage-0 
DEBU[0000] nsexec-0[15862]: stage-1 complete            
DEBU[0000] nsexec-0[15862]: <- stage-1 synchronisation loop 
DEBU[0000] nsexec-0[15862]: -> stage-2 synchronisation loop 
DEBU[0000] nsexec-0[15862]: signalling stage-2 to run   
DEBU[0000] nsexec-1[15868]: <~ nsexec stage-1           
DEBU[0000] nsexec-2[1]: signal completion to stage-0    
DEBU[0000] nsexec-0[15862]: stage-2 complete            
DEBU[0000] nsexec-0[15862]: <- stage-2 synchronisation loop 
DEBU[0000] nsexec-0[15862]: <~ nsexec stage-0           
DEBU[0000] nsexec-2[1]: <= nsexec container setup       
DEBU[0000] nsexec-2[1]: booting up go runtime ...       
DEBU[0000] child process in init()                      
ERRO[0000]utils.go:61 main.fatalWithCode() runc run failed: unable to start container process: error during container init: error mounting "/etc/resolv.conf" to rootfs at "/etc/resolv.conf": mount /proc/self/fd/7:/etc/resolv.conf (via /proc/self/fd/8), flags: 0x5021: operation not permitted

Here are the following permissions on COS:

ls -dl / /var /var/lib/ /var/lib/containerd/ /var/lib/containerd/io.containerd.grpc.v1.cri/ /var/lib/containerd/io.containerd.grpc.v1.cri/containers/ /var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes /var/lib/containerd/io.containerd.grpc.v1.cri/containers/17459338903803feb96dbcc21fabda6bf4f89d259be7b27964370a119513b723/ /var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/02101d92680d89025fdb18bf26656b41cc859f1f8fe515ed587c0374d9673bad/
drwxr-xr-x 20 root root 4096 Jan 21 10:59 /
drwxr-xr-x  9 root root 4096 Mar 27 20:48 /var
drwxr-xr-x 23 root root 4096 Mar 28 00:00 /var/lib/
drwxr-xr-x 12 root root 4096 Mar 27 21:03 /var/lib/containerd/
drwxr-xr-x  4 root root 4096 Mar 27 20:53 /var/lib/containerd/io.containerd.grpc.v1.cri/
drwxr-xr-x 25 root root 4096 Mar 27 20:55 /var/lib/containerd/io.containerd.grpc.v1.cri/containers/
drwxr-xr-x  2 root root 4096 Mar 27 21:02 /var/lib/containerd/io.containerd.grpc.v1.cri/containers/17459338903803feb96dbcc21fabda6bf4f89d259be7b27964370a119513b723/
drwxr-xr-x 15 root root 4096 Mar 28 00:12 /var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes
drwxr-xr-x  2 root root 4096 Mar 27 20:53 /var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/02101d92680d89025fdb18bf26656b41cc859f1f8fe515ed587c0374d9673bad/

ls -ld /dev/shm /etc/resolv.conf 
drwxrwxrwt 2 root root 40 Mar 27 20:48 /dev/shm
lrwxrwxrwx 1 root root 32 Jan 21 10:34 /etc/resolv.conf -> /run/systemd/resolve/resolv.conf

ls -la /run/systemd/resolve
total 8
drwxr-xr-x  2 systemd-resolve systemd-resolve 100 Mar 27 20:55 .
drwxr-xr-x 24 root            root            580 Mar 27 21:41 ..
srw-rw-rw-  1 systemd-resolve systemd-resolve   0 Mar 27 20:48 io.systemd.Resolve
-rw-r--r--  1 systemd-resolve systemd-resolve 831 Mar 27 20:48 resolv.conf
-rw-r--r--  1 systemd-resolve systemd-resolve 961 Mar 27 20:48 stub-resolv.conf

I also checked some mounts

mount | grep /etc
tmpfs on /etc/machine-id type tmpfs (ro,size=804600k,nr_inodes=819200,mode=755)
overlayfs on /etc type overlay (rw,relatime,lowerdir=/etc,upperdir=/tmp/etc_overlay/etc,workdir=/tmp/etc_overlay/.work)

now I looked at the mount like you suggested and noticed the following in config.json

{
      "destination": "/dev/shm",
      "type": "bind",
      "source": "/dev/shm",
      "options": [
        "rbind",
        "ro",
        "nosuid",
        "nodev",
        "noexec"
      ]
    },
    {
      "destination": "/etc/resolv.conf",
      "type": "bind",
      "source": "/etc/resolv.conf",
      "options": [
        "rbind",
        "ro"
      ]
    }

As you mentioned /etc/resolv.conf did not have nosuid, nodev, noexec so I updated config.json to add that. After update it looks like

{
      "destination": "/dev/shm",
      "type": "bind",
      "source": "/dev/shm",
      "options": [
        "rbind",
        "ro",
        "nosuid",
        "nodev",
        "noexec"
      ]
    },
    {
      "destination": "/etc/resolv.conf",
      "type": "bind",
      "source": "/etc/resolv.conf",
      "options": [
        "rbind",
        "ro",
        "nosuid",
        "nodev",
        "noexec"
      ]
    }

now I rerun container using runc

/home/kubernetes/bin/runc.amd64 --debug --systemd-cgroup run mycontainer 
DEBU[0000] nsexec[19004]: => nsexec container setup     
DEBU[0000] nsexec[19004]: update /proc/self/oom_score_adj to '-998' 
DEBU[0000] nsexec-0[19004]: ~> nsexec stage-0           
DEBU[0000] nsexec-0[19004]: spawn stage-1               
DEBU[0000] nsexec-0[19004]: -> stage-1 synchronisation loop 
DEBU[0000] nsexec-1[19009]: ~> nsexec stage-1           
DEBU[0000] nsexec-1[19009]: unshare user namespace      
DEBU[0000] nsexec-1[19009]: request stage-0 to map user namespace 
DEBU[0000] nsexec-0[19004]: stage-1 requested userns mappings 
DEBU[0000] nsexec-0[19004]: update /proc/19009/uid_map to '0 1354629120 65536
' 
DEBU[0000] nsexec-0[19004]: update /proc/19009/gid_map to '0 1354629120 65536
' 
DEBU[0000] nsexec-1[19009]: request stage-0 to map user namespace 
DEBU[0000] nsexec-1[19009]: unshare remaining namespace (except cgroupns) 
DEBU[0000] nsexec-1[19009]: spawn stage-2               
DEBU[0000] nsexec-1[19009]: request stage-0 to forward stage-2 pid (19010) 
DEBU[0000] nsexec-0[19004]: stage-1 requested pid to be forwarded 
DEBU[0000] nsexec-0[19004]: forward stage-1 (19009) and stage-2 (19010) pids to runc 
DEBU[0000] nsexec-2[1]: ~> nsexec stage-2               
DEBU[0000] nsexec-1[19009]: signal completion to stage-0 
DEBU[0000] nsexec-0[19004]: stage-1 complete            
DEBU[0000] nsexec-0[19004]: <- stage-1 synchronisation loop 
DEBU[0000] nsexec-0[19004]: -> stage-2 synchronisation loop 
DEBU[0000] nsexec-0[19004]: signalling stage-2 to run   
DEBU[0000] nsexec-1[19009]: <~ nsexec stage-1           
DEBU[0000] nsexec-2[1]: signal completion to stage-0    
DEBU[0000] nsexec-0[19004]: stage-2 complete            
DEBU[0000] nsexec-0[19004]: <- stage-2 synchronisation loop 
DEBU[0000] nsexec-0[19004]: <~ nsexec stage-0           
DEBU[0000] nsexec-2[1]: <= nsexec container setup       
DEBU[0000] nsexec-2[1]: booting up go runtime ...       
DEBU[0000] child process in init()                      
DEBU[0000]libcontainer/cgroups/systemd/common.go:296 libcontainer/cgroups/systemd.generateDeviceProperties() skipping device /dev/char/10:200 for systemd: stat /dev/char/10:200: no such file or directory 
ERRO[0000]utils.go:61 main.fatalWithCode() runc run failed: unable to start container process: exec /pause: permission denied

Now it looks like the error for mounting is fixed but we have a different error.

now I update from git with this PR applied https://github.com/opencontainers/runc/pull/3753

I did the following on COS so I had to run the toolbox container

toolbox
apt-get install git wget libseccomp-dev
git clone https://github.com/kolyshkin/runc.git -b user-exec
cd runc
make static
cp runc /media/root/home/kubernetes/bin/runc.git

Then exited the toolbox container and I ran the following command in the same folder as config.json from previous attempts.

/tmp/runc-wrapper-debug-k8s/9119 # /home/kubernetes/bin/runc.git --debug --systemd-cgroup run mycontainer
DEBU[0000] nsexec[61499]: => nsexec container setup     
DEBU[0000] nsexec[61499]: update /proc/self/oom_score_adj to '-998' 
DEBU[0000] nsexec-0[61499]: ~> nsexec stage-0           
DEBU[0000] nsexec-0[61499]: spawn stage-1               
DEBU[0000] nsexec-0[61499]: -> stage-1 synchronisation loop 
DEBU[0000] nsexec-1[61505]: ~> nsexec stage-1           
DEBU[0000] nsexec-1[61505]: unshare user namespace      
DEBU[0000] nsexec-1[61505]: request stage-0 to map user namespace 
DEBU[0000] nsexec-0[61499]: stage-1 requested userns mappings 
DEBU[0000] nsexec-0[61499]: update /proc/61505/uid_map to '0 1354629120 65536
' 
DEBU[0000] nsexec-0[61499]: update /proc/61505/gid_map to '0 1354629120 65536
' 
DEBU[0000] nsexec-1[61505]: request stage-0 to map user namespace 
DEBU[0000] nsexec-1[61505]: unshare remaining namespaces (except cgroupns) 
DEBU[0000] nsexec-1[61505]: request stage-0 to send mount sources 
DEBU[0000] nsexec-0[61499]: stage-1 requested to open mount sources 
DEBU[0000] nsexec-0[61499]: ~> sending fd for: /dev/shm 
DEBU[0000] nsexec-0[61499]: ~> sending fd for: /etc/resolv.conf 
DEBU[0000] nsexec-1[61505]: spawn stage-2               
DEBU[0000] nsexec-1[61505]: request stage-0 to forward stage-2 pid (61506) 
DEBU[0000] nsexec-0[61499]: stage-1 requested pid to be forwarded 
DEBU[0000] nsexec-0[61499]: forward stage-1 (61505) and stage-2 (61506) pids to runc 
DEBU[0000] nsexec-2[1]: ~> nsexec stage-2               
DEBU[0000] nsexec-1[61505]: signal completion to stage-0 
DEBU[0000] nsexec-0[61499]: stage-1 complete            
DEBU[0000] nsexec-0[61499]: <- stage-1 synchronisation loop 
DEBU[0000] nsexec-0[61499]: -> stage-2 synchronisation loop 
DEBU[0000] nsexec-0[61499]: signalling stage-2 to run   
DEBU[0000] nsexec-1[61505]: <~ nsexec stage-1           
DEBU[0000] nsexec-2[1]: signal completion to stage-0    
DEBU[0000] nsexec-0[61499]: stage-2 complete            
DEBU[0000] nsexec-0[61499]: <- stage-2 synchronisation loop 
DEBU[0000] nsexec-0[61499]: <~ nsexec stage-0           
DEBU[0000] nsexec-2[1]: <= nsexec container setup       
DEBU[0000] nsexec-2[1]: booting up go runtime ...       
DEBU[0000] child process in init()                      
ERRO[0000]utils.go:62 main.fatalWithCode() runc run failed: unable to start container process: exec: "/pause": permission denied

That PR does not seem to fix this.

Regarding running bats in COS

I am using bats core 1.9.0 on COS and getting the following error.

../bats-core-1.9.0/bin/bats -t tests/integration/userns.bats 
1..4
not ok 1 userns with simple mount
# (in test file tests/integration/userns.bats, line 34)
#   `[ "$status" -eq 0 ]' failed
# runc spec (status=0):
#
# runc run test_busybox (status=1):
# time="2023-03-28T01:19:24Z" level=error msg="runc run failed: unable to start container process: exec: \"sh\": executable file not found in $PATH"
not ok 2 userns with 2 inaccessible mounts
# (in test file tests/integration/userns.bats, line 52)
#   `[ "$status" -eq 0 ]' failed
# runc spec (status=0):
#
# runc run test_busybox (status=1):
# time="2023-03-28T01:19:24Z" level=error msg="runc run failed: unable to start container process: exec: \"sh\": executable file not found in $PATH"
not ok 3 userns with inaccessible mount + exec
# (in test file tests/integration/userns.bats, line 62)
#   `[ "$status" -eq 0 ]' failed
# runc spec (status=0):
#
# runc run -d --console-socket /tmp/bats-run-86Yrds/runc.9V71dI/tty/sock test_busybox (status=1):
# time="2023-03-28T01:19:25Z" level=error msg="runc run failed: unable to start container process: exec: \"sh\": executable file not found in $PATH"
ok 4 userns with bind mount before a cgroupfs mount # skip test requires cgroups_v1

Regarding cgroupv1 and cgroupv2

I am using cgroupv2 as I am setting SystemdCgroup = true option in containerd config and when I run runc command directly I set --systemd-cgroup flag.

cat /etc/containerd/config.toml | grep Cgroup
  SystemdCgroup = true

COS supports cgroup v2 I check by running the following command:-

grep cgroup /proc/filesystems
nodev	cgroup
nodev	cgroup2

I don't plan to test cgroup v1 as all GKE setup is configured to cgroup v2.

SupplementaryGroups=0

Since none of the host dirs in question are actually 750 I don't think setting this would change anything so I am not planning to do this.

Strace output

I already shared the strace output with your before but please let me know if you need strace again from any of the steps above.

Regarding the PR

According to containerd behavior of setting the options for other mounts it makes sense to add these options to /etc/resolv.conf mounts as well. Thanks for the feedback but I disagree with you that it is not the right behavior for containerd to set these options consistently.

Hopefully that answers your questions. I was going to answer your questions today anyways but I only posted about the progress that was due to your idea about adding the options to the mount. I posted here before answering other questions because it was significant as the mount errors were fixed and I had already chatted to you about be working on the other things you asked for in slack. (they seemed less important compared to notifying you that changing mount options fixes the mounting error.)

I hope now we are on the same page and hopefully we can get to the bottom of this issue.

Mar 28 '23 01:03 vinayakankugoyal

@rata - I think I got this work on COS.

Here is what I did:

Created a GKE COS Cluster using

gcloud container clusters create host-user-vinaygo-cos --num-nodes=1 --cluster-version=1.26.1-gke.1500 --enable-kubernetes-alpha --no-enable-autorepair --no-enable-autoupgrade --release-channel=rapid --image-type=cos_containerd

SSH into the node and build containerd with my PR https://github.com/containerd/containerd/pull/8309. I did not have https://github.com/opencontainers/runc/pull/3753.

toolbox
apt-get update
apt-get install git wget
wget -c https://go.dev/dl/go1.20.2.linux-amd64.tar.gz
tar -C /usr/local -xzf go1.20.2.linux-amd64.tar.gz
export PATH=$PATH:/usr/local/go/bin
git clone https://github.com/vinayakankugoyal/containerd.git -b fixresolv
cd containerd
make binaries
cp ./bin/containerd /media/root/home/bin/containerd.git

Then I ran the following to run containerd that I just built

mount --bind /home/kubernetes/bin/containerd.git /usr/bin/containerd

systemctl restart containerd.service

Then I updated the runc version to 1.1.4

wget -c https://github.com/opencontainers/runc/releases/download/v1.1.4/runc.amd64

chmod 777 /home/kubernetes/bin/runc.amd64

mount --bind /home/kubernetes/bin/runc.amd64 /usr/bin/runc

Then I updated kubelet in GKE to not add somaxconn to all the pods

sed -i 's/net.core.somaxconn=1024,//g' /etc/default/kubelet

systemctl restart kubelet

Then I created the following Pod to see if it comes up

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: namespace-user-vinaygo
spec:
  hostUsers: false
  containers:
  - name: namespace-user-vinaygo
    image: debian
    command:
    - sleep
    - infinity
EOF

Now back on the node I execed into the Pod to make sure it was running in user ns

crictl ps -a --name namespace-user
CONTAINER           IMAGE               CREATED             STATE               NAME                     ATTEMPT             POD ID              POD
55d0a000a2b3a       f5b06fd900402       19 minutes ago      Running             namespace-user-vinaygo   0                   f5f27e3f69294       namespace-user-vinaygo

crictl exec -it 55d0a000a2b3a /bin/bash
root@namespace-user-vinaygo:/# ps -aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0   2396   568 ?        Ss   20:08   0:00 sleep infinity
root         348  0.0  0.0   4032  3404 pts/0    Ss   20:29   0:00 /bin/bash
root         355  0.0  0.0   6760  2972 pts/0    R+   20:29   0:00 ps -aux

root@namespace-user-vinaygo:/# readlink /proc/self/ns/user
user:[4026532465]
root@namespace-user-vinaygo:/# cat /proc/self/uid_map
         0 3306815488      65536
root@namespace-user-vinaygo:/#

Now on the node I checked what the uid of the process that was running sleep infinity command

ps ax o user:16,pid,command | grep "sleep infinity"
3306815488         34311 sleep infinity
root               40716 grep --colour=auto sleep infinity

As we can see the container process is running as UID 3306815488.

Mar 28 '23 20:03 vinayakankugoyal

@vinayakankugoyal

I have mentioned both in chat and in the report now that Ubuntu seems to be working once I remove the somaxconn sysctl. I had posted that Ubuntu works if I remove the sysctl a week ago.

Sure, but that is manually running runc, not starting a k8s pod. My understanding was that runc was running, but the k8s pod for another reason than the sysctl was not working on Ubuntu either. Miscommunication, that is all :)

Let me try to give as much details as I can about my setup now:

Thanks, this report really helps A LOT.

COS

sed -i 's/net.core.somaxconn=1024,//g' /etc/default/kubelet

Ohh, great to know what you are doing. Then, can you paste the /etc/default/kubelet file? Or at least the section mentioning this sysctl and sysctl relevant sections? I'd like to see if, as I imagine, the kubelet is allowing that unsafe sysctl to the list or what is doing with that.

I'm curious to understand what is GKE doing here. My guess is that the kubelet allows that unsafe sysctl to be used, and that a mutating webhook adds those sysctl to the pod or something like that. But unsure why if it is not safe on one node, how the hook realizes that... Maybe something completely different is happening?

To verify this, can you:

Create a pod (without user namespaces, it doesn't matter now) before modifying the /etc/default/kubelet file and get the output of kubectl get pod -o yaml ? I want to see if the sysctls are set in the pod security context or something
Get the same kubectl output, but after changing the kubelet config and for a pod with user namespaces enabled. To see that is indeed not there as it was before or something.

....

I also checked some mounts

mount | grep /etc
tmpfs on /etc/machine-id type tmpfs (ro,size=804600k,nr_inodes=819200,mode=755)
overlayfs on /etc type overlay (rw,relatime,lowerdir=/etc,upperdir=/tmp/etc_overlay/etc,workdir=/tmp/etc_overlay/.work)

Right, but as /etc/resolv.conf is a symlink to another path outside etc, we need to see the mount options of them. Just for completeness, can you post the output of mount | grep /run.

But what we are really interested at is the output of mount | grep var, as that is where the original resolv.conf is mounted from. My guess is that it is mounted with those options, but let's verify to be sure.

....

DEBU[0000]libcontainer/cgroups/systemd/common.go:296 libcontainer/cgroups/systemd.generateDeviceProperties() skipping device /dev/char/10:200 for systemd: stat /dev/char/10:200: no such file or directory 
ERRO[0000]utils.go:61 main.fatalWithCode() runc run failed: unable to start container process: exec /pause: permission denied

I'm still curious on why you found this error when running manually. Is it something obvious (like not execute bit in the binary) or something like that, maybe due to some cp option missing?

There must be some difference to when those flags are added by containerd (that seems to work) and when we add them manually here...

Regarding cgroupsv1/v2: agree. It was relevant to know if you were using cgroups v1 as you could trigger some bugs with that, but if you are not using it, no need to try it out. Regarding supplemental groups: Those are not the only relevant directories, though. But yeah, no need to try it now that it works :)

Regarding the PR: my point was that we didn't know if this helps in any way to any real use case (we do know now). If we want to open the PR due to consistency, we should mention that. If we need these to make a real world OS work, we need to mention it. Until we know which case it is, we can't really open the PR and mention the reasons, so it can be properly reviewed (it is not the same to review the change and think this is needed to fix COS vs the author thinks this is nice to have)

@rata - I think I got this work on COS.

Great! Then why was the permissions denied error caused before, have you figured it out?

Regarding somax sysctl

Do you want to investigate further what can we do and follow-up on that? I'll check what crun does, just in case, too.

Regarding possible remount on runc

Do you want to open an issue here in runc and ask about remounting with those flags, even if they are not specified? Crun (another OCI compatible runtime) is doing that: https://github.com/containers/crun/blob/main/src/libcrun/linux.c#L919-L946.

We might want to do this in runc to keep compatibility with crun, maybe not. I think opening an issue to discuss with maintainers makes sense. If there is agreement on going down that route, if you want to implement it, it would be great! :)

Mar 29 '23 11:03 rata

Hope i'm not adding too much entropy to this discussion, this issue piqued my interest and following @vinayakankugoyal steps i managed to also repro it on COS. With runc at 1.1.4 (fetched from github same way a above) and containerd at 1.6.2 (the current COS version) i then bisected the issue down to a7adeb69769395193a0278c4bda6068011d06cde, symptoms are the same as posted originally:

$ k describe pod namespace-user-lrascao

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting "/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/1091069a744cd525029de4a7b59d4f2ac3a9784bf2d64dbddb78b070e4f0481f/resolv.conf" to rootfs at "/etc/resolv.conf": mount /proc/self/fd/7:/etc/resolv.conf (via /proc/self/fd/9), flags: 0x5021: operation not permitted: unknown

Mar 29 '23 11:03 lrascao

With runc at 1.1.4 (fetched from github same way a above) and containerd at 1.6.2 (the current COS version) i then bisected the issue down to a7adeb69769395193a0278c4bda6068011d06cde, symptoms are the same as posted originally:

@lrascao That is a commit on containerd, right? It seems to be the one I wrote "cri: Support pods with user namespaces".

Thanks for the effort. But it doesn't add any information really: before that commit user namespaces are not used, so all the user namespaces messages are ignored by containerd, so a regular pod is created. And with that commit, of course, the container with userns is created and due to the special mount options of COS, that fails in that environment.

Thanks anyways :)

Mar 29 '23 12:03 rata

I opened a discussion thread in runc regarding remounting bind mounts if they fail with the right options. https://github.com/opencontainers/runc/discussions/3801.

Mar 31 '23 15:03 vinayakankugoyal

@vinayakankugoyal friendly ping? I'll be busy with Kubecon next week, but wanted to re-bump this

Apr 14 '23 10:04 rata

runc runc copied to clipboard

Failure to run user namespaced container

Description

Steps to reproduce the issue

Describe the results you received and expected

What version of runc are you using?

Host OS information

Host kernel information

Ubuntu

COS

Ubuntu

Ubuntu Version

Install containerd 1.7.0

Install runc 1.1.4

Updated kubelet to not add somaxconn to any pod

Created a Pod

Note that this pod comes up fine.

COS

COS version

Install containerd 1.7.0

Install runc 1.1.4 but also store the config.json before calling runc

Updated kubelet to not add somaxconn to any pod

Created a Pod

The pod is stuck in CreatingContainer state.

Now I investigate the config.json

Now I first try to run this config.json

now I update from git with this PR applied https://github.com/opencontainers/runc/pull/3753

Regarding running bats in COS

Regarding cgroupv1 and cgroupv2

SupplementaryGroups=0

Strace output

Regarding the PR

COS

Regarding somax sysctl

Regarding possible remount on runc

runc
runc copied to clipboard

Install `containerd` 1.7.0

Install `runc` 1.1.4

Install `containerd` 1.7.0

Install `runc` 1.1.4 but also store the config.json before calling runc