runc
runc copied to clipboard
Failure to run user namespaced container
Description
Unable to run user-namespaced container.
My setup is
containerd v1.7.0 (which supports usernamespaces)
ctr version
Client:
Version: v1.7.0
Revision: 1fbd70374134b891f97ce19c70b6e50c7b9f4e0d
Go version: go1.20.2
Server:
Version: v1.7.0
Revision: 1fbd70374134b891f97ce19c70b6e50c7b9f4e0d
UUID: 514e04fd-642e-4f20-a0bd-99b3bbdb3c65
runc version 1.1.4
runc --version
runc version 1.1.4
commit: v1.1.4-0-g5fd4c4d1
spec: 1.0.2-dev
go: go1.17.10
libseccomp: 2.5.4
Here are the commandline args being passed to runc by containerd
--root /run/containerd/runc/k8s.io --log /run/containerd/io.containerd.runtime.v2.task/k8s.io/0fb39cf81c4a554b1e7b0ce148a705cb8bdc3624b0ce8681541b73da87290b4b/log.json --log-format json --systemd-cgroup create --bundle /run/containerd/io.containerd.runtime.v2.task/k8s.io/0fb39cf81c4a554b1e7b0ce148a705cb8bdc3624b0ce8681541b73da87290b4b --pid-file /run/containerd/io.containerd.runtime.v2.task/k8s.io/0fb39cf81c4a554b1e7b0ce148a705cb8bdc3624b0ce8681541b73da87290b4b/init.pid 0fb39cf81c4a554b1e7b0ce148a705cb8bdc3624b0ce8681541b73da87290b4b
Here is the config.json
{"ociVersion":"1.1.0-rc.1","process":{"user":{"uid":65535,"gid":65535,"additionalGids":[65535]},"args":["/pause"],"env":["PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"],"cwd":"/","capabilities":{"bounding":["CAP_CHOWN","CAP_DAC_OVERRIDE","CAP_FSETID","CAP_FOWNER","CAP_MKNOD","CAP_NET_RAW","CAP_SETGID","CAP_SETUID","CAP_SETFCAP","CAP_SETPCAP","CAP_NET_BIND_SERVICE","CAP_SYS_CHROOT","CAP_KILL","CAP_AUDIT_WRITE"],"effective":["CAP_CHOWN","CAP_DAC_OVERRIDE","CAP_FSETID","CAP_FOWNER","CAP_MKNOD","CAP_NET_RAW","CAP_SETGID","CAP_SETUID","CAP_SETFCAP","CAP_SETPCAP","CAP_NET_BIND_SERVICE","CAP_SYS_CHROOT","CAP_KILL","CAP_AUDIT_WRITE"],"permitted":["CAP_CHOWN","CAP_DAC_OVERRIDE","CAP_FSETID","CAP_FOWNER","CAP_MKNOD","CAP_NET_RAW","CAP_SETGID","CAP_SETUID","CAP_SETFCAP","CAP_SETPCAP","CAP_NET_BIND_SERVICE","CAP_SYS_CHROOT","CAP_KILL","CAP_AUDIT_WRITE"]},"noNewPrivileges":true,"oomScoreAdj":-998},"root":{"path":"rootfs","readonly":true},"hostname":"user-namespace-vinaygo","mounts":[{"destination":"/proc","type":"proc","source":"proc","options":["nosuid","noexec","nodev"]},{"destination":"/dev","type":"tmpfs","source":"tmpfs","options":["nosuid","strictatime","mode=755","size=65536k"]},{"destination":"/dev/pts","type":"devpts","source":"devpts","options":["nosuid","noexec","newinstance","ptmxmode=0666","mode=0620","gid=5"]},{"destination":"/dev/mqueue","type":"mqueue","source":"mqueue","options":["nosuid","noexec","nodev"]},{"destination":"/sys","type":"sysfs","source":"sysfs","options":["nosuid","noexec","nodev","ro"]},{"destination":"/dev/shm","type":"bind","source":"/run/containerd/io.containerd.grpc.v1.cri/sandboxes/3b291dcc8869a5f32cade6834c46bd90a1217c298c0dcc8a5115393ce7c6f40a/shm","options":["rbind","ro","nosuid","nodev","noexec"]},{"destination":"/etc/resolv.conf","type":"bind","source":"/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/3b291dcc8869a5f32cade6834c46bd90a1217c298c0dcc8a5115393ce7c6f40a/resolv.conf","options":["rbind","ro"]}],"annotations":{"io.kubernetes.cri.container-type":"sandbox","io.kubernetes.cri.sandbox-cpu-period":"100000","io.kubernetes.cri.sandbox-cpu-quota":"0","io.kubernetes.cri.sandbox-cpu-shares":"2","io.kubernetes.cri.sandbox-id":"3b291dcc8869a5f32cade6834c46bd90a1217c298c0dcc8a5115393ce7c6f40a","io.kubernetes.cri.sandbox-log-directory":"/var/log/pods/default_user-namespace-vinaygo_80bd4a09-b19b-4d81-800b-6b5d605b1558","io.kubernetes.cri.sandbox-memory":"0","io.kubernetes.cri.sandbox-name":"user-namespace-vinaygo","io.kubernetes.cri.sandbox-namespace":"default","io.kubernetes.cri.sandbox-uid":"80bd4a09-b19b-4d81-800b-6b5d605b1558"},"linux":{"uidMappings":[{"containerID":0,"hostID":2515861504,"size":65536}],"gidMappings":[{"containerID":0,"hostID":2515861504,"size":65536}],"sysctl":{"net.core.somaxconn":"1024","net.ipv4.conf.all.accept_redirects":"0","net.ipv4.conf.all.forwarding":"1","net.ipv4.conf.all.route_localnet":"1","net.ipv4.conf.default.forwarding":"1","net.ipv4.ip_forward":"1","net.ipv4.tcp_fin_timeout":"60","net.ipv4.tcp_keepalive_intvl":"60","net.ipv4.tcp_keepalive_probes":"5","net.ipv4.tcp_keepalive_time":"300","net.ipv4.tcp_rmem":"4096 87380 6291456","net.ipv4.tcp_syn_retries":"6","net.ipv4.tcp_tw_reuse":"0","net.ipv4.tcp_wmem":"4096 16384 4194304","net.ipv4.udp_rmem_min":"4096","net.ipv4.udp_wmem_min":"4096","net.ipv6.conf.all.disable_ipv6":"1","net.ipv6.conf.default.accept_ra":"0","net.ipv6.conf.default.disable_ipv6":"1","net.netfilter.nf_conntrack_generic_timeout":"600","net.netfilter.nf_conntrack_tcp_be_liberal":"1","net.netfilter.nf_conntrack_tcp_timeout_close_wait":"3600","net.netfilter.nf_conntrack_tcp_timeout_established":"86400"},"resources":{"devices":[{"allow":false,"access":"rwm"}],"cpu":{"shares":2}},"cgroupsPath":"kubepods-besteffort-pod80bd4a09_b19b_4d81_800b_6b5d605b1558.slice:cri-containerd:3b291dcc8869a5f32cade6834c46bd90a1217c298c0dcc8a5115393ce7c6f40a","namespaces":[{"type":"pid"},{"type":"ipc"},{"type":"uts"},{"type":"mount"},{"type":"network"},{"type":"user"}],"seccomp":{"defaultAction":"SCMP_ACT_ERRNO","architectures":["SCMP_ARCH_X86_64","SCMP_ARCH_X86","SCMP_ARCH_X32"],"syscalls":[{"names":["accept","accept4","access","adjtimex","alarm","bind","brk","capget","capset","chdir","chmod","chown","chown32","clock_adjtime","clock_adjtime64","clock_getres","clock_getres_time64","clock_gettime","clock_gettime64","clock_nanosleep","clock_nanosleep_time64","close","close_range","connect","copy_file_range","creat","dup","dup2","dup3","epoll_create","epoll_create1","epoll_ctl","epoll_ctl_old","epoll_pwait","epoll_pwait2","epoll_wait","epoll_wait_old","eventfd","eventfd2","execve","execveat","exit","exit_group","faccessat","faccessat2","fadvise64","fadvise64_64","fallocate","fanotify_mark","fchdir","fchmod","fchmodat","fchown","fchown32","fchownat","fcntl","fcntl64","fdatasync","fgetxattr","flistxattr","flock","fork","fremovexattr","fsetxattr","fstat","fstat64","fstatat64","fstatfs","fstatfs64","fsync","ftruncate","ftruncate64","futex","futex_time64","futex_waitv","futimesat","getcpu","getcwd","getdents","getdents64","getegid","getegid32","geteuid","geteuid32","getgid","getgid32","getgroups","getgroups32","getitimer","getpeername","getpgid","getpgrp","getpid","getppid","getpriority","getrandom","getresgid","getresgid32","getresuid","getresuid32","getrlimit","get_robust_list","getrusage","getsid","getsockname","getsockopt","get_thread_area","gettid","gettimeofday","getuid","getuid32","getxattr","inotify_add_watch","inotify_init","inotify_init1","inotify_rm_watch","io_cancel","ioctl","io_destroy","io_getevents","io_pgetevents","io_pgetevents_time64","ioprio_get","ioprio_set","io_setup","io_submit","io_uring_enter","io_uring_register","io_uring_setup","ipc","kill","landlock_add_rule","landlock_create_ruleset","landlock_restrict_self","lchown","lchown32","lgetxattr","link","linkat","listen","listxattr","llistxattr","_llseek","lremovexattr","lseek","lsetxattr","lstat","lstat64","madvise","membarrier","memfd_create","memfd_secret","mincore","mkdir","mkdirat","mknod","mknodat","mlock","mlock2","mlockall","mmap","mmap2","mprotect","mq_getsetattr","mq_notify","mq_open","mq_timedreceive","mq_timedreceive_time64","mq_timedsend","mq_timedsend_time64","mq_unlink","mremap","msgctl","msgget","msgrcv","msgsnd","msync","munlock","munlockall","munmap","nanosleep","newfstatat","_newselect","open","openat","openat2","pause","pidfd_open","pidfd_send_signal","pipe","pipe2","pkey_alloc","pkey_free","pkey_mprotect","poll","ppoll","ppoll_time64","prctl","pread64","preadv","preadv2","prlimit64","process_mrelease","pselect6","pselect6_time64","pwrite64","pwritev","pwritev2","read","readahead","readlink","readlinkat","readv","recv","recvfrom","recvmmsg","recvmmsg_time64","recvmsg","remap_file_pages","removexattr","rename","renameat","renameat2","restart_syscall","rmdir","rseq","rt_sigaction","rt_sigpending","rt_sigprocmask","rt_sigqueueinfo","rt_sigreturn","rt_sigsuspend","rt_sigtimedwait","rt_sigtimedwait_time64","rt_tgsigqueueinfo","sched_getaffinity","sched_getattr","sched_getparam","sched_get_priority_max","sched_get_priority_min","sched_getscheduler","sched_rr_get_interval","sched_rr_get_interval_time64","sched_setaffinity","sched_setattr","sched_setparam","sched_setscheduler","sched_yield","seccomp","select","semctl","semget","semop","semtimedop","semtimedop_time64","send","sendfile","sendfile64","sendmmsg","sendmsg","sendto","setfsgid","setfsgid32","setfsuid","setfsuid32","setgid","setgid32","setgroups","setgroups32","setitimer","setpgid","setpriority","setregid","setregid32","setresgid","setresgid32","setresuid","setresuid32","setreuid","setreuid32","setrlimit","set_robust_list","setsid","setsockopt","set_thread_area","set_tid_address","setuid","setuid32","setxattr","shmat","shmctl","shmdt","shmget","shutdown","sigaltstack","signalfd","signalfd4","sigprocmask","sigreturn","socketcall","socketpair","splice","stat","stat64","statfs","statfs64","statx","symlink","symlinkat","sync","sync_file_range","syncfs","sysinfo","tee","tgkill","time","timer_create","timer_delete","timer_getoverrun","timer_gettime","timer_gettime64","timer_settime","timer_settime64","timerfd_create","timerfd_gettime","timerfd_gettime64","timerfd_settime","timerfd_settime64","times","tkill","truncate","truncate64","ugetrlimit","umask","uname","unlink","unlinkat","utime","utimensat","utimensat_time64","utimes","vfork","vmsplice","wait4","waitid","waitpid","write","writev"],"action":"SCMP_ACT_ALLOW"},{"names":["socket"],"action":"SCMP_ACT_ALLOW","args":[{"index":0,"value":40,"op":"SCMP_CMP_NE"}]},{"names":["personality"],"action":"SCMP_ACT_ALLOW","args":[{"index":0,"value":0,"op":"SCMP_CMP_EQ"}]},{"names":["personality"],"action":"SCMP_ACT_ALLOW","args":[{"index":0,"value":8,"op":"SCMP_CMP_EQ"}]},{"names":["personality"],"action":"SCMP_ACT_ALLOW","args":[{"index":0,"value":131072,"op":"SCMP_CMP_EQ"}]},{"names":["personality"],"action":"SCMP_ACT_ALLOW","args":[{"index":0,"value":131080,"op":"SCMP_CMP_EQ"}]},{"names":["personality"],"action":"SCMP_ACT_ALLOW","args":[{"index":0,"value":4294967295,"op":"SCMP_CMP_EQ"}]},{"names":["process_vm_readv","process_vm_writev","ptrace"],"action":"SCMP_ACT_ALLOW"},{"names":["arch_prctl","modify_ldt"],"action":"SCMP_ACT_ALLOW"},{"names":["chroot"],"action":"SCMP_ACT_ALLOW"},{"names":["clone"],"action":"SCMP_ACT_ALLOW","args":[{"index":0,"value":2114060288,"op":"SCMP_CMP_MASKED_EQ"}]},{"names":["clone3"],"action":"SCMP_ACT_ERRNO","errnoRet":38}]},"maskedPaths":["/proc/acpi","/proc/asound","/proc/kcore","/proc/keys","/proc/latency_stats","/proc/timer_list","/proc/timer_stats","/proc/sched_debug","/sys/firmware","/proc/scsi"],"readonlyPaths":["/proc/bus","/proc/fs","/proc/irq","/proc/sys","/proc/sysrq-trigger"]}}
Steps to reproduce the issue
With containerd 1.7.0 and runc 1.1.4 installed run the following:
create a container with the following config.json
{"ociVersion":"1.1.0-rc.1","process":{"user":{"uid":65535,"gid":65535,"additionalGids":[65535]},"args":["/pause"],"env":["PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"],"cwd":"/","capabilities":{"bounding":["CAP_CHOWN","CAP_DAC_OVERRIDE","CAP_FSETID","CAP_FOWNER","CAP_MKNOD","CAP_NET_RAW","CAP_SETGID","CAP_SETUID","CAP_SETFCAP","CAP_SETPCAP","CAP_NET_BIND_SERVICE","CAP_SYS_CHROOT","CAP_KILL","CAP_AUDIT_WRITE"],"effective":["CAP_CHOWN","CAP_DAC_OVERRIDE","CAP_FSETID","CAP_FOWNER","CAP_MKNOD","CAP_NET_RAW","CAP_SETGID","CAP_SETUID","CAP_SETFCAP","CAP_SETPCAP","CAP_NET_BIND_SERVICE","CAP_SYS_CHROOT","CAP_KILL","CAP_AUDIT_WRITE"],"permitted":["CAP_CHOWN","CAP_DAC_OVERRIDE","CAP_FSETID","CAP_FOWNER","CAP_MKNOD","CAP_NET_RAW","CAP_SETGID","CAP_SETUID","CAP_SETFCAP","CAP_SETPCAP","CAP_NET_BIND_SERVICE","CAP_SYS_CHROOT","CAP_KILL","CAP_AUDIT_WRITE"]},"noNewPrivileges":true,"oomScoreAdj":-998},"root":{"path":"rootfs","readonly":true},"hostname":"user-namespace-vinaygo","mounts":[{"destination":"/proc","type":"proc","source":"proc","options":["nosuid","noexec","nodev"]},{"destination":"/dev","type":"tmpfs","source":"tmpfs","options":["nosuid","strictatime","mode=755","size=65536k"]},{"destination":"/dev/pts","type":"devpts","source":"devpts","options":["nosuid","noexec","newinstance","ptmxmode=0666","mode=0620","gid=5"]},{"destination":"/dev/mqueue","type":"mqueue","source":"mqueue","options":["nosuid","noexec","nodev"]},{"destination":"/sys","type":"sysfs","source":"sysfs","options":["nosuid","noexec","nodev","ro"]},{"destination":"/dev/shm","type":"bind","source":"/run/containerd/io.containerd.grpc.v1.cri/sandboxes/3b291dcc8869a5f32cade6834c46bd90a1217c298c0dcc8a5115393ce7c6f40a/shm","options":["rbind","ro","nosuid","nodev","noexec"]},{"destination":"/etc/resolv.conf","type":"bind","source":"/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/3b291dcc8869a5f32cade6834c46bd90a1217c298c0dcc8a5115393ce7c6f40a/resolv.conf","options":["rbind","ro"]}],"annotations":{"io.kubernetes.cri.container-type":"sandbox","io.kubernetes.cri.sandbox-cpu-period":"100000","io.kubernetes.cri.sandbox-cpu-quota":"0","io.kubernetes.cri.sandbox-cpu-shares":"2","io.kubernetes.cri.sandbox-id":"3b291dcc8869a5f32cade6834c46bd90a1217c298c0dcc8a5115393ce7c6f40a","io.kubernetes.cri.sandbox-log-directory":"/var/log/pods/default_user-namespace-vinaygo_80bd4a09-b19b-4d81-800b-6b5d605b1558","io.kubernetes.cri.sandbox-memory":"0","io.kubernetes.cri.sandbox-name":"user-namespace-vinaygo","io.kubernetes.cri.sandbox-namespace":"default","io.kubernetes.cri.sandbox-uid":"80bd4a09-b19b-4d81-800b-6b5d605b1558"},"linux":{"uidMappings":[{"containerID":0,"hostID":2515861504,"size":65536}],"gidMappings":[{"containerID":0,"hostID":2515861504,"size":65536}],"sysctl":{"net.core.somaxconn":"1024","net.ipv4.conf.all.accept_redirects":"0","net.ipv4.conf.all.forwarding":"1","net.ipv4.conf.all.route_localnet":"1","net.ipv4.conf.default.forwarding":"1","net.ipv4.ip_forward":"1","net.ipv4.tcp_fin_timeout":"60","net.ipv4.tcp_keepalive_intvl":"60","net.ipv4.tcp_keepalive_probes":"5","net.ipv4.tcp_keepalive_time":"300","net.ipv4.tcp_rmem":"4096 87380 6291456","net.ipv4.tcp_syn_retries":"6","net.ipv4.tcp_tw_reuse":"0","net.ipv4.tcp_wmem":"4096 16384 4194304","net.ipv4.udp_rmem_min":"4096","net.ipv4.udp_wmem_min":"4096","net.ipv6.conf.all.disable_ipv6":"1","net.ipv6.conf.default.accept_ra":"0","net.ipv6.conf.default.disable_ipv6":"1","net.netfilter.nf_conntrack_generic_timeout":"600","net.netfilter.nf_conntrack_tcp_be_liberal":"1","net.netfilter.nf_conntrack_tcp_timeout_close_wait":"3600","net.netfilter.nf_conntrack_tcp_timeout_established":"86400"},"resources":{"devices":[{"allow":false,"access":"rwm"}],"cpu":{"shares":2}},"cgroupsPath":"kubepods-besteffort-pod80bd4a09_b19b_4d81_800b_6b5d605b1558.slice:cri-containerd:3b291dcc8869a5f32cade6834c46bd90a1217c298c0dcc8a5115393ce7c6f40a","namespaces":[{"type":"pid"},{"type":"ipc"},{"type":"uts"},{"type":"mount"},{"type":"network"},{"type":"user"}],"seccomp":{"defaultAction":"SCMP_ACT_ERRNO","architectures":["SCMP_ARCH_X86_64","SCMP_ARCH_X86","SCMP_ARCH_X32"],"syscalls":[{"names":["accept","accept4","access","adjtimex","alarm","bind","brk","capget","capset","chdir","chmod","chown","chown32","clock_adjtime","clock_adjtime64","clock_getres","clock_getres_time64","clock_gettime","clock_gettime64","clock_nanosleep","clock_nanosleep_time64","close","close_range","connect","copy_file_range","creat","dup","dup2","dup3","epoll_create","epoll_create1","epoll_ctl","epoll_ctl_old","epoll_pwait","epoll_pwait2","epoll_wait","epoll_wait_old","eventfd","eventfd2","execve","execveat","exit","exit_group","faccessat","faccessat2","fadvise64","fadvise64_64","fallocate","fanotify_mark","fchdir","fchmod","fchmodat","fchown","fchown32","fchownat","fcntl","fcntl64","fdatasync","fgetxattr","flistxattr","flock","fork","fremovexattr","fsetxattr","fstat","fstat64","fstatat64","fstatfs","fstatfs64","fsync","ftruncate","ftruncate64","futex","futex_time64","futex_waitv","futimesat","getcpu","getcwd","getdents","getdents64","getegid","getegid32","geteuid","geteuid32","getgid","getgid32","getgroups","getgroups32","getitimer","getpeername","getpgid","getpgrp","getpid","getppid","getpriority","getrandom","getresgid","getresgid32","getresuid","getresuid32","getrlimit","get_robust_list","getrusage","getsid","getsockname","getsockopt","get_thread_area","gettid","gettimeofday","getuid","getuid32","getxattr","inotify_add_watch","inotify_init","inotify_init1","inotify_rm_watch","io_cancel","ioctl","io_destroy","io_getevents","io_pgetevents","io_pgetevents_time64","ioprio_get","ioprio_set","io_setup","io_submit","io_uring_enter","io_uring_register","io_uring_setup","ipc","kill","landlock_add_rule","landlock_create_ruleset","landlock_restrict_self","lchown","lchown32","lgetxattr","link","linkat","listen","listxattr","llistxattr","_llseek","lremovexattr","lseek","lsetxattr","lstat","lstat64","madvise","membarrier","memfd_create","memfd_secret","mincore","mkdir","mkdirat","mknod","mknodat","mlock","mlock2","mlockall","mmap","mmap2","mprotect","mq_getsetattr","mq_notify","mq_open","mq_timedreceive","mq_timedreceive_time64","mq_timedsend","mq_timedsend_time64","mq_unlink","mremap","msgctl","msgget","msgrcv","msgsnd","msync","munlock","munlockall","munmap","nanosleep","newfstatat","_newselect","open","openat","openat2","pause","pidfd_open","pidfd_send_signal","pipe","pipe2","pkey_alloc","pkey_free","pkey_mprotect","poll","ppoll","ppoll_time64","prctl","pread64","preadv","preadv2","prlimit64","process_mrelease","pselect6","pselect6_time64","pwrite64","pwritev","pwritev2","read","readahead","readlink","readlinkat","readv","recv","recvfrom","recvmmsg","recvmmsg_time64","recvmsg","remap_file_pages","removexattr","rename","renameat","renameat2","restart_syscall","rmdir","rseq","rt_sigaction","rt_sigpending","rt_sigprocmask","rt_sigqueueinfo","rt_sigreturn","rt_sigsuspend","rt_sigtimedwait","rt_sigtimedwait_time64","rt_tgsigqueueinfo","sched_getaffinity","sched_getattr","sched_getparam","sched_get_priority_max","sched_get_priority_min","sched_getscheduler","sched_rr_get_interval","sched_rr_get_interval_time64","sched_setaffinity","sched_setattr","sched_setparam","sched_setscheduler","sched_yield","seccomp","select","semctl","semget","semop","semtimedop","semtimedop_time64","send","sendfile","sendfile64","sendmmsg","sendmsg","sendto","setfsgid","setfsgid32","setfsuid","setfsuid32","setgid","setgid32","setgroups","setgroups32","setitimer","setpgid","setpriority","setregid","setregid32","setresgid","setresgid32","setresuid","setresuid32","setreuid","setreuid32","setrlimit","set_robust_list","setsid","setsockopt","set_thread_area","set_tid_address","setuid","setuid32","setxattr","shmat","shmctl","shmdt","shmget","shutdown","sigaltstack","signalfd","signalfd4","sigprocmask","sigreturn","socketcall","socketpair","splice","stat","stat64","statfs","statfs64","statx","symlink","symlinkat","sync","sync_file_range","syncfs","sysinfo","tee","tgkill","time","timer_create","timer_delete","timer_getoverrun","timer_gettime","timer_gettime64","timer_settime","timer_settime64","timerfd_create","timerfd_gettime","timerfd_gettime64","timerfd_settime","timerfd_settime64","times","tkill","truncate","truncate64","ugetrlimit","umask","uname","unlink","unlinkat","utime","utimensat","utimensat_time64","utimes","vfork","vmsplice","wait4","waitid","waitpid","write","writev"],"action":"SCMP_ACT_ALLOW"},{"names":["socket"],"action":"SCMP_ACT_ALLOW","args":[{"index":0,"value":40,"op":"SCMP_CMP_NE"}]},{"names":["personality"],"action":"SCMP_ACT_ALLOW","args":[{"index":0,"value":0,"op":"SCMP_CMP_EQ"}]},{"names":["personality"],"action":"SCMP_ACT_ALLOW","args":[{"index":0,"value":8,"op":"SCMP_CMP_EQ"}]},{"names":["personality"],"action":"SCMP_ACT_ALLOW","args":[{"index":0,"value":131072,"op":"SCMP_CMP_EQ"}]},{"names":["personality"],"action":"SCMP_ACT_ALLOW","args":[{"index":0,"value":131080,"op":"SCMP_CMP_EQ"}]},{"names":["personality"],"action":"SCMP_ACT_ALLOW","args":[{"index":0,"value":4294967295,"op":"SCMP_CMP_EQ"}]},{"names":["process_vm_readv","process_vm_writev","ptrace"],"action":"SCMP_ACT_ALLOW"},{"names":["arch_prctl","modify_ldt"],"action":"SCMP_ACT_ALLOW"},{"names":["chroot"],"action":"SCMP_ACT_ALLOW"},{"names":["clone"],"action":"SCMP_ACT_ALLOW","args":[{"index":0,"value":2114060288,"op":"SCMP_CMP_MASKED_EQ"}]},{"names":["clone3"],"action":"SCMP_ACT_ERRNO","errnoRet":38}]},"maskedPaths":["/proc/acpi","/proc/asound","/proc/kcore","/proc/keys","/proc/latency_stats","/proc/timer_list","/proc/timer_stats","/proc/sched_debug","/sys/firmware","/proc/scsi"],"readonlyPaths":["/proc/bus","/proc/fs","/proc/irq","/proc/sys","/proc/sysrq-trigger"]}}
Describe the results you received and expected
I get the following error:
"runc create failed: unable to start container process: error during container init: error mounting \"/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/6065696b009d70452b2b229d976df91ff2b2e3bf75c6855bb91f4f1c42a4f1e9/resolv.conf\" to rootfs at \"/etc/resolv.conf\": mount /proc/self/fd/7:/etc/resolv.conf (via /proc/self/fd/8), flags: 0x5021: operation not permitted: unknown" pod="default/user-namespace-vinaygo"
expected:
No error. Non user namespace containers are able to run.
What version of runc are you using?
runc version 1.1.4 commit: v1.1.4-0-g5fd4c4d1 spec: 1.0.2-dev go: go1.17.10 libseccomp: 2.5.4
Host OS information
NAME="Container-Optimized OS" ID=cos PRETTY_NAME="Container-Optimized OS from Google" HOME_URL="https://cloud.google.com/container-optimized-os/docs" BUG_REPORT_URL="https://cloud.google.com/container-optimized-os/docs/resources/support-policy#contact_us" GOOGLE_METRICS_PRODUCT_ID=26 KERNEL_COMMIT_ID=44456f0e9d2cd7a9616fb0d05bc4020237839a5a GOOGLE_CRASH_ID=Lakitu VERSION=101 VERSION_ID=101 BUILD_ID=17162.40.56
Host kernel information
Linux
/cc @rata
Thanks! I can't repro with that, though :(
The issue really seems like the same symptom that PR https://github.com/opencontainers/runc/pull/3511 fixed, but that is in 1.1.4 and you are running with 1.1.4. So maybe something with a similar symptom is still lurking there.
What I've tried so far is and failed to repro this is:
- That config.json, adding a pause binary in a busybox rootfs (properly chown to the hostID in the userns mapping) and removing the dev/shm and the resolv.conf bind mounts, as the source for those doesn't exist on my computer. This works fine when running
runc run --debug --systemd-cgroup mycontainer
- I've tried keeping those mounts, but using /dev/shm and /etc/resolv.conf as source and run runc as before, this also works fine
- I've created /mnt/test/ where test doesn't have rx permissions for others, and copy the resolv.conf file there to use as source. This works fine too
- I've also chown /mnt/test to user 1:1 (so it is not owned by root, as runc is running), but this also worked fine
- I've tried chaning one directory in the path to the rootfs to not have rx permissions for others (sudo chown o-rx /home/
), but this doesn't fail there. The mount of the rootfs fails, so it fails before that. - I've started a k8s cluster with containerd 1.7 and runc 1.1.4 and I still don't see the issue when creating a pod with userns.
So, I can't really repro with that config. It would be great if you
- Find a way to repro this on debian or some other distro that we can have easy acces on. This would help a lot
- cd to the dir where the config.json file is and run:
sudo runc run --debug --systemd-cgroup mycontainer
and paste here what it prints (does it fail or doesn't?) - Are you using cgroups v1 or cgroups v2? Can you please explain try with both and explain how you verify that indeed you are using one or the other?
- If the runc command in 2 fails, can you also run
strace -f -s 512 <command from step 2>
? - Can you copy the runc git repo to that host, also install bats and run:
sudo bats -t tests/integration/userns.bats
- Can you paste the output of running
ls -ld
for all paths in the resolv.conf mount? Likesudo ls -dl / /var /var/lib/ /var/lib/containerd/ /var/lib/containerd/io.containerd.grpc.v1.cri/ ...
- What is your container runtime? Is it containerd or docker? If it is containerd and you are starting it with a systemd service, can you add
SupplementaryGroups=0
to the systemd service, restart containerd and see if the problem still happens? This is due to this bug https://github.com/opencontainers/runc/issues/2484, that was fixed https://github.com/opencontainers/runc/commit/9c444070ec7bb83995dbc0185da68284da71c554 but introduced the regression that was fixed in 1.1.4. If you run with that, and that is causing the issue, it should work-around it. I doubt it will help, but the more information we have, the easiest to debug (specially when we can't repro).
These are the things that come to mind that might help us debug this. But it would be great if @kolyshkin can have a look.
Thanks for the detailed steps @rata! Ill run through them and report back.
The container runtime is containerd 1.7.0
I still can't get things to run on an ubuntu node
cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.1 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
uname -a
Linux 5.15.0-1024-gke #29-Ubuntu SMP Fri Dec 16 06:28:18 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Here is the error:
"Failed to create sandbox for pod" err="rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: open /proc/sys/net/core/somaxconn: no such file or directory: unknown" pod="default/user-namespace-vinaygo"
Then I cd into the directory with config.json
186816# ls
address config.json log options.json rootfs runtime shim-binary-path work
root@gke-host-user-vinaygo-default-pool-b3011f78-g2jd:/tmp/rata-debug-k8s/186816# /home/kubernetes/bin/runc/runc --debug --systemd-cgroup run mycontainer
DEBU[0000] nsexec[187238]: => nsexec container setup
DEBU[0000] nsexec[187238]: update /proc/self/oom_score_adj to '-998'
DEBU[0000] nsexec-0[187238]: ~> nsexec stage-0
DEBU[0000] nsexec-0[187238]: spawn stage-1
DEBU[0000] nsexec-0[187238]: -> stage-1 synchronisation loop
DEBU[0000] nsexec-1[187240]: ~> nsexec stage-1
DEBU[0000] nsexec-1[187240]: unshare user namespace
DEBU[0000] nsexec-1[187240]: request stage-0 to map user namespace
DEBU[0000] nsexec-0[187238]: stage-1 requested userns mappings
DEBU[0000] nsexec-0[187238]: update /proc/187240/uid_map to '0 3091988480 65536
'
DEBU[0000] nsexec-0[187238]: update /proc/187240/gid_map to '0 3091988480 65536
'
DEBU[0000] nsexec-1[187240]: request stage-0 to map user namespace
DEBU[0000] nsexec-1[187240]: unshare remaining namespace (except cgroupns)
DEBU[0000] nsexec-1[187240]: request stage-0 to send mount sources
DEBU[0000] nsexec-0[187238]: stage-1 requested to open mount sources
FATA[0000] nsexec-0[187238]: failed to open mount source /run/containerd/io.containerd.grpc.v1.cri/sandboxes/77e8d26e550636d55b510e943ccc24bda2d9474a3c7ad58c5acdeead4e9f15f8/shm: No such file or directory
FATA[0000] nsexec-1[187240]: failed to receive fd from unix socket 8: Invalid argument
ERRO[0000]utils.go:62 main.fatalWithCode() runc run failed: unable to start container process: can't get final child's PID from pipe: EOF
Then cloned runc git repo and ran bats
bats -t tests/integration/userns.bats
1..4
ok 1 userns with simple mount
ok 2 userns with 2 inaccessible mounts
ok 3 userns with inaccessible mount + exec
not ok 4 userns with bind mount before a cgroupfs mount
# (from function `requires' in file tests/integration/helpers.bash, line 488,
# in test file tests/integration/userns.bats, line 72)
# `requires cgroups_v1' failed
# runc spec (status=0):
#
# /usr/lib/bats-core/test_functions.bash: line 57: BATS_TEARDOWN_STARTED: unbound variable
@vinayakankugoyal things should run on Ubuntu, it is probably some config or binary missing on your side.
The error you pasted is from containerd, that is not what we want. The runc output you pasted is not useful either, see that it says:
failed to open mount source /run/containerd/io.containerd.grpc.v1.cri/sandboxes/77e8d26e550636d55b510e943ccc24bda2d9474a3c7ad58c5acdeead4e9f15f8/shm: No such file or directory
That file doesn't exist anymore and you are not seeing the error you saw before. You will need to repro this when the file exists, or copy it and adjust the config.json (those two bind mounts, the shm and the resolv.conf).
The bats output not sure it is useful either, it seems to throw an error due to some bats variable not working. Maybe is something with your bats installation?
Also, when you have the time, please see all the things that I asked and answer with all :)
Ok I was able to get a repro by pointing the config.json to another running containers sandbox id. Now I get the same error as I was getting from kubelet on ubuntu.
root@gke-host-user-vinaygo-default-pool-b3011f78-g2jd:/tmp/rata-debug-k8s/1227175# /home/kubernetes/bin/runc.amd64 --debug --systemd-cgroup run blah
DEBU[0000] nsexec[1238579]: => nsexec container setup
DEBU[0000] nsexec[1238579]: update /proc/self/oom_score_adj to '-998'
DEBU[0000] nsexec-0[1238579]: ~> nsexec stage-0
DEBU[0000] nsexec-0[1238579]: spawn stage-1
DEBU[0000] nsexec-0[1238579]: -> stage-1 synchronisation loop
DEBU[0000] nsexec-1[1238583]: ~> nsexec stage-1
DEBU[0000] nsexec-1[1238583]: unshare user namespace
DEBU[0000] nsexec-1[1238583]: request stage-0 to map user namespace
DEBU[0000] nsexec-0[1238579]: stage-1 requested userns mappings
DEBU[0000] nsexec-0[1238579]: update /proc/1238583/uid_map to '0 3091988480 65536
'
DEBU[0000] nsexec-0[1238579]: update /proc/1238583/gid_map to '0 3091988480 65536
'
DEBU[0000] nsexec-1[1238583]: request stage-0 to map user namespace
DEBU[0000] nsexec-1[1238583]: unshare remaining namespace (except cgroupns)
DEBU[0000] nsexec-1[1238583]: spawn stage-2
DEBU[0000] nsexec-1[1238583]: request stage-0 to forward stage-2 pid (1238584)
DEBU[0000] nsexec-0[1238579]: stage-1 requested pid to be forwarded
DEBU[0000] nsexec-0[1238579]: forward stage-1 (1238583) and stage-2 (1238584) pids to runc
DEBU[0000] nsexec-2[1]: ~> nsexec stage-2
DEBU[0000] nsexec-1[1238583]: signal completion to stage-0
DEBU[0000] nsexec-0[1238579]: stage-1 complete
DEBU[0000] nsexec-0[1238579]: <- stage-1 synchronisation loop
DEBU[0000] nsexec-0[1238579]: -> stage-2 synchronisation loop
DEBU[0000] nsexec-0[1238579]: signalling stage-2 to run
DEBU[0000] nsexec-1[1238583]: <~ nsexec stage-1
DEBU[0000] nsexec-2[1]: signal completion to stage-0
DEBU[0000] nsexec-0[1238579]: stage-2 complete
DEBU[0000] nsexec-0[1238579]: <- stage-2 synchronisation loop
DEBU[0000] nsexec-0[1238579]: <~ nsexec stage-0
DEBU[0000] nsexec-2[1]: <= nsexec container setup
DEBU[0000] nsexec-2[1]: booting up go runtime ...
DEBU[0000] child process in init()
ERRO[0000]utils.go:61 main.fatalWithCode() runc run failed: unable to start container process: error during container init: open /proc/sys/net/core/somaxconn: no such file or directory
I noticed the following sysctls:
"sysctl": {
"net.core.somaxconn": "1024",
"net.ipv4.conf.all.accept_redirects": "0",
"net.ipv4.conf.all.forwarding": "1",
"net.ipv4.conf.all.route_localnet": "1",
"net.ipv4.conf.default.forwarding": "1",
"net.ipv4.ip_forward": "1",
"net.ipv4.tcp_fin_timeout": "60",
"net.ipv4.tcp_keepalive_intvl": "60",
"net.ipv4.tcp_keepalive_probes": "5",
"net.ipv4.tcp_keepalive_time": "300",
"net.ipv4.tcp_rmem": "4096 87380 6291456",
"net.ipv4.tcp_syn_retries": "6",
"net.ipv4.tcp_tw_reuse": "0",
"net.ipv4.tcp_wmem": "4096 16384 4194304",
"net.ipv4.udp_rmem_min": "4096",
"net.ipv4.udp_wmem_min": "4096",
"net.ipv6.conf.all.disable_ipv6": "1",
"net.ipv6.conf.default.accept_ra": "0",
"net.ipv6.conf.default.disable_ipv6": "1",
"net.netfilter.nf_conntrack_generic_timeout": "600",
"net.netfilter.nf_conntrack_tcp_be_liberal": "1",
"net.netfilter.nf_conntrack_tcp_timeout_close_wait": "3600",
"net.netfilter.nf_conntrack_tcp_timeout_established": "86400"
},
if I remove "net.core.somaxconn": "1024", from config.json it works.
Here is what I see when I remove that sysctl
runc.amd64 --debug --systemd-cgroup run blah
DEBU[0000] nsexec[1271048]: => nsexec container setup
DEBU[0000] nsexec[1271048]: update /proc/self/oom_score_adj to '-998'
DEBU[0000] nsexec-0[1271048]: ~> nsexec stage-0
DEBU[0000] nsexec-0[1271048]: spawn stage-1
DEBU[0000] nsexec-0[1271048]: -> stage-1 synchronisation loop
DEBU[0000] nsexec-1[1271052]: ~> nsexec stage-1
DEBU[0000] nsexec-1[1271052]: unshare user namespace
DEBU[0000] nsexec-1[1271052]: request stage-0 to map user namespace
DEBU[0000] nsexec-1[1271052]: request stage-0 to map user namespace
DEBU[0000] nsexec-0[1271048]: stage-1 requested userns mappings
DEBU[0000] nsexec-0[1271048]: update /proc/1271052/uid_map to '0 3091988480 65536
'
DEBU[0000] nsexec-0[1271048]: update /proc/1271052/gid_map to '0 3091988480 65536
'
DEBU[0000] nsexec-1[1271052]: unshare remaining namespace (except cgroupns)
DEBU[0000] nsexec-1[1271052]: spawn stage-2
DEBU[0000] nsexec-1[1271052]: request stage-0 to forward stage-2 pid (1271053)
DEBU[0000] nsexec-0[1271048]: stage-1 requested pid to be forwarded
DEBU[0000] nsexec-0[1271048]: forward stage-1 (1271052) and stage-2 (1271053) pids to runc
DEBU[0000] nsexec-1[1271052]: signal completion to stage-0
DEBU[0000] nsexec-0[1271048]: stage-1 complete
DEBU[0000] nsexec-0[1271048]: <- stage-1 synchronisation loop
DEBU[0000] nsexec-0[1271048]: -> stage-2 synchronisation loop
DEBU[0000] nsexec-0[1271048]: signalling stage-2 to run
DEBU[0000] nsexec-1[1271052]: <~ nsexec stage-1
DEBU[0000] nsexec-2[1]: ~> nsexec stage-2
DEBU[0000] nsexec-2[1]: signal completion to stage-0
DEBU[0000] nsexec-2[1]: <= nsexec container setup
DEBU[0000] nsexec-2[1]: booting up go runtime ...
DEBU[0000] nsexec-0[1271048]: stage-2 complete
DEBU[0000] nsexec-0[1271048]: <- stage-2 synchronisation loop
DEBU[0000] nsexec-0[1271048]: <~ nsexec stage-0
DEBU[0000] child process in init()
DEBU[0000] seccomp: prepending -ENOSYS stub filter to user filter...
DEBU[0000] [ 0] ld [4]
DEBU[0000] [ 1] jeq #1073741827,2
DEBU[0000] [ 2] jeq #3221225534,4
DEBU[0000] [ 3] ja 10
DEBU[0000] [ 4] ld [0]
DEBU[0000] [ 5] jgt #449,7
DEBU[0000] [ 6] ja 7
DEBU[0000] [ 7] ld [0]
DEBU[0000] [ 8] jset #1073741824,1
DEBU[0000] [ 9] jgt #449,3,1
DEBU[0000] [ 10] jgt #1073742371,2
DEBU[0000] [ 11] ja 2
DEBU[0000] [ 12] ja 1
DEBU[0000] [ 13] ret #327718
DEBU[0000] [....] --- original filter ---
DEBU[0000] init: closing the pipe to signal completion
Non usernamespace containers also have that sysctl in their config.json and those are coming up fine.
@vinayakankugoyal Hmm, that is not the same error, isn't it? I mean, the one you mentioned here: open /proc/sys/net/core/somaxconn: no such file or directory
.
That is not the same error you reported in the original issue description. Am I missing something?
Please try to create a repro for the issue in the original issue. As I mentioned, I tried several scenarios but couldn't repro, but it seems you have a setup that you can repro, so let's see how is it that you hit it.
I am not able to repro the same issue that I am seeing on the COS based nodes on the Ubuntu based nodes.
On COS I am seeing:
ERRO[0000]utils.go:61 main.fatalWithCode() runc run failed: unable to start container process: error during container init: error mounting "/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/13780a324d8f2a5d2cf886a5b7b2cf549be345626e2405ef3b3a38e862fe27cf/resolv.conf" to rootfs at "/etc/resolv.conf": mount /proc/self/fd/7:/etc/resolv.conf (via /proc/self/fd/8), flags: 0x5021: operation not permitted
On Ubuntu nodes I am seeing:
open /proc/sys/net/core/somaxconn: no such file or directory
However I did the same changed on the COS node and now I get the original error. I also ran strace and here is the output filtered to anything that had resolv.conf in it.
cat strace.txt | grep resolv.conf
314004 read(3, "c\",\"nodev\",\"ro\"]},{\"destination\":\"/dev/shm\",\"type\":\"bind\",\"source\":\"/run/containerd/io.containerd.grpc.v1.cri/sandboxes/13780a324d8f2a5d2cf886a5b7b2cf549be345626e2405ef3b3a38e862fe27cf/shm\",\"options\":[\"rbind\",\"ro\",\"nosuid\",\"nodev\",\"noexec\"]},{\"destination\":\"/etc/resolv.conf\",\"type\":\"bind\",\"source\":\"/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/13780a324d8f2a5d2cf886a5b7b2cf549be345626e2405ef3b3a38e862fe27cf/resolv.conf\",\"options\":[\"rbind\",\"ro\"]}],\"annotations\":{\"io.kubernetes.cri.container-type\":\""..., 2048) = 2048
314013 write(14, "l\1\0\0000\362\1\0\1\0\0\0\0\0\0\0\10\0\221j\0\0\2|\30\0\223j0 3975479296 65536\n\0\30\0\224j0 3975479296 65536\n\0\10\0\225j\1\0\0\0\t\0\226j-998\0\0\0\0\10\0\227j\0\0\0\0\10\1\232j\0\0\0\0\0/run/containerd/io.containerd.grpc.v1.cri/sandboxes/13780a324d8f2a5d2cf886a5b7b2cf549be345626e2405ef3b3a38e862fe27cf/shm\0/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/13780a324d8f2a5d2cf886a5b7b2cf549be345626e2405ef3b3a38e862fe27cf/resolv.conf\0\0", 364 <unfinished ...>
314010 read(3, "\10\0\221j\0\0\2|\30\0\223j0 3975479296 65536\n\0\30\0\224j0 3975479296 65536\n\0\10\0\225j\1\0\0\0\t\0\226j-998\0\0\0\0\10\0\227j\0\0\0\0\10\1\232j\0\0\0\0\0/run/containerd/io.containerd.grpc.v1.cri/sandboxes/13780a324d8f2a5d2cf886a5b7b2cf549be345626e2405ef3b3a38e862fe27cf/shm\0/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/13780a324d8f2a5d2cf886a5b7b2cf549be345626e2405ef3b3a38e862fe27cf/resolv.conf\0\0", 348) = 348
314010 openat(AT_FDCWD, "/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/13780a324d8f2a5d2cf886a5b7b2cf549be345626e2405ef3b3a38e862fe27cf/resolv.conf", O_RDONLY|O_CLOEXEC|O_PATH <unfinished ...>
314017 newfstatat(AT_FDCWD, "/tmp/runc-wrapper-debug-k8s/74237/rootfs/etc/resolv.conf", {st_mode=S_IFREG|0755, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
314017 newfstatat(AT_FDCWD, "/tmp/runc-wrapper-debug-k8s/74237/rootfs/etc/resolv.conf", {st_mode=S_IFREG|0755, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
314017 newfstatat(AT_FDCWD, "/tmp/runc-wrapper-debug-k8s/74237/rootfs/etc/resolv.conf", {st_mode=S_IFREG|0755, st_size=0, ...}, 0) = 0
314017 newfstatat(AT_FDCWD, "/tmp/runc-wrapper-debug-k8s/74237/rootfs/etc/resolv.conf", {st_mode=S_IFREG|0755, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
314017 openat(AT_FDCWD, "/tmp/runc-wrapper-debug-k8s/74237/rootfs/etc/resolv.conf", O_RDONLY|O_CLOEXEC|O_PATH) = 8
314017 readlinkat(AT_FDCWD, "/proc/self/fd/8", "/tmp/runc-wrapper-debug-k8s/74237/rootfs/etc/resolv.conf", 128) = 56
314017 newfstatat(AT_FDCWD, "/tmp/runc-wrapper-debug-k8s/74237/rootfs/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=77, ...}, AT_SYMLINK_NOFOLLOW) = 0
314017 openat(AT_FDCWD, "/tmp/runc-wrapper-debug-k8s/74237/rootfs/etc/resolv.conf", O_RDONLY|O_CLOEXEC|O_PATH) = 8
314017 readlinkat(AT_FDCWD, "/proc/self/fd/8", "/tmp/runc-wrapper-debug-k8s/74237/rootfs/etc/resolv.conf", 128) = 56
314017 newfstatat(AT_FDCWD, "/tmp/runc-wrapper-debug-k8s/74237/rootfs/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=77, ...}, AT_SYMLINK_NOFOLLOW) = 0
314017 openat(AT_FDCWD, "/tmp/runc-wrapper-debug-k8s/74237/rootfs/etc/resolv.conf", O_RDONLY|O_CLOEXEC|O_PATH) = 8
314017 readlinkat(AT_FDCWD, "/proc/self/fd/8", "/tmp/runc-wrapper-debug-k8s/74237/rootfs/etc/resolv.conf", 128) = 56
314017 write(3, "{\"message\":\"error mounting \\\"/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/13780a324d8f2a5d2cf886a5b7b2cf549be345626e2405ef3b3a38e862fe27cf/resolv.conf\\\" to rootfs at \\\"/etc/resolv.conf\\\": mount /proc/self/fd/7:/etc/resolv.conf (via /proc/self/fd/8), flags: 0x5021: operation not permitted\"}", 301 <unfinished ...>
314013 <... read resumed>"{\"message\":\"error mounting \\\"/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/13780a324d8f2a5d2cf886a5b7b2cf549be345626e2405ef3b3a38e862fe27cf/resolv.conf\\\" to rootfs at \\\"/etc/resolv.conf\\\": mount /proc/self/fd/7:/etc/resolv.conf (via /proc/self/fd/8), flags: 0x5021: operation not permitted\"}", 512) = 301
314012 write(2, "\33[31mERRO\33[0m[0000]utils.go:61 main.fatalWithCode() runc run failed: unable to start container process: error during container init: error mounting \"/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/13780a324d8f2a5d2cf886a5b7b2cf549be345626e2405ef3b3a38e862fe27cf/resolv.conf\" to rootfs at \"/etc/resolv.conf\": mount /proc/self/fd/7:/etc/resolv.conf (via /proc/self/fd/8), flags: 0x5021: operation not permitted \n", 418) = 418
This is how the directory permissions are setup:
# ls -ld / /var /var/lib/ /var/lib/containerd/ /var/lib/containerd/io.containerd.grpc.v1.cri/ /var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/ /var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/13780a324d8f2a5d2cf886a5b7b2cf549be345626e2405ef3b3a38e862fe27cf/ /var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/13780a324d8f2a5d2cf886a5b7b2cf549be345626e2405ef3b3a38e862fe27cf/resolv.conf
drwxr-xr-x 20 root root 4096 Jan 21 10:59 /
drwxr-xr-x 9 root root 4096 Mar 20 16:45 /var
drwxr-xr-x 23 root root 4096 Mar 20 22:00 /var/lib/
drwxr-xr-x 12 root root 4096 Mar 20 20:18 /var/lib/containerd/
drwxr-xr-x 4 root root 4096 Mar 20 16:48 /var/lib/containerd/io.containerd.grpc.v1.cri/
drwxr-xr-x 16 root root 4096 Mar 20 22:08 /var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/
drwxr-xr-x 2 root root 4096 Mar 20 16:48 /var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/13780a324d8f2a5d2cf886a5b7b2cf549be345626e2405ef3b3a38e862fe27cf/
-rw-r--r-- 1 root root 77 Mar 20 16:48 /var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/13780a324d8f2a5d2cf886a5b7b2cf549be345626e2405ef3b3a38e862fe27cf/resolv.conf
@vinayakankugoyal if you can repro with runc run, can you then do all the things I asked to instead of just these? The strace uploaded to some site would be nice, the grep doesn't seem to show useful information.
Also, though, you might have not reproduced the issue. The strace output you put does not have any mount calls, and we do know it is failing in the mount call (technically it can be failing in the prior checks, but seems unlikely as that mount works without userns)
your repro is too confusing, but if your kernel is v5.15 and set sysctl in a new usernamespace, may be you ran into this problem:
net: Don't export sysctls to unprivileged users
tbl[0].data = &net->core.sysctl_somaxconn;
/* Don't export any sysctls to unprivileged users */
if (net->user_ns != &init_user_ns) {
tbl[0].procname = NULL;
}
this code is in v5.15
@rata - here is the strace output. strace.txt
@vinayakankugoyal cool. Can you paste the other things I asked here?
Sorry I'm repeating this in a loop, but you continue to answer with partial information and not comment at all on doing the other things I asked. If you planned to do it later, please let me know, so I don't ask you repeatedly.
@vinayakankugoyal So, let's go one report at a time.
For Ubuntu 22.04: I can't reproduce. I've installed a VM with Ubuntu 22.04, run apt dist-upgrade to have the latest versions. This is what I did, because I'm very used to the development setup, no other special reason.
- apt install runc # this installs runc 1.1.4 from ubuntu repos. Make sure you are installing 1.1.4, you need the jammy-updates apt repo to install this, otherwise it will install 1.1.0
- Install containerd 1.7.0 from official binaries, as explained here: https://github.com/containerd/containerd/blob/main/docs/getting-started.md#option-1-from-the-official-binaries
- clone containerd repo and run this script, to configure the CNI: https://github.com/containerd/containerd/blob/main/script/setup/install-cni
- Created a containerd config file with: containerd config defaults > config.toml
- Changed the root, state and grpc.address variables so it doesn't conflict with the system containerd (in case you have one)
- In my case I used:
root = "/var/lib/containerd-rata"
,state = "/run/containerd-rata"
, and for the grpc sectionaddress = "/run/containerd-rata/containerd.sock"
- Note it is okay that these folders don't exist, they will be created by containerd later
- start containerd and leave it running: sudo containerd --config config.toml
- Install go 1.20.2 from official binaries (you can find instructions here: https://go.dev/doc/install)
- Clone kubernetes repo: https://github.com/kubernetes/kubernetes
- Switch to the k8s 1.26.1 tag (as you told me you were using): git checkout v1.26.1
- In the terminal you will later start kubernetes do (or adjust the vars names to yours):
export CONTAINER_RUNTIME_ENDPOINT=/run/containerd-rata/containerd.sock
export IMAGE_SERVICE_ENDPOINT=/run/containerd-rata/containerd.sock
export CONTAINER_RUNTIME=remote
- apt install make
- start kubernetes: hack/local-up-cluster.sh
- The first time that command will ask you to run other commands to install etcd, etc. Follow those steps and run hack/local-up-cluster.sh again
This will start a k8s cluster and you should see in your terminal running containerd some activity (it will try to create the coredns pod).
In this setup, I've applied the pod you sent me via slack:
apiVersion: v1
kind: Pod
metadata:
name: namespace-user-vinaygo
spec:
hostUsers: false
containers:
- name: namespace-user-vinaygo
image: debian
command:
- sleep
- infinity
You can download the kubectl binary and, as the k8s start script printed, you can export KUBECONFIG=/var/run/kubernetes/admin.kubeconfig
and use kubectl that will just work with that cluster.
The pod was created just fine and with userns:
$ ./kubectl exec -ti namespace-user-vinaygo -- bash
root@namespace-user-vinaygo:/# cat /proc/self/uid_map
0 1473642496 65536
Note that in this ubuntu version there is a bug (it seems to be fixed in latest debian, will probably come in ubuntu 23.04) where hitting ctrl-c doesn't work to stop the k8s cluster. You can kill the processes if you are using a vm via ssh by running: ~.
. This will kill the ssh session and that usually sends a SIGHUP that will kill the processes. But make sure all the processes are died before starting a k8s again, as otherwise you will see weird errors (certificates are regenerated, so auth will fail and all the world collides in weird ways). Maybe something like this helps to kill them all, but verify or just reboot the server if you are unsure :-D: kill $(ps faux | grep hack | grep bash | awk '{ print $2 }'); sudo pkill -f kube-scheduler; sudo rm /var/run/kubernetes/kubelet-rotated.kubeconfig
Due to that, you might need to run this after killing all the k8s processes but before starting it again: sudo chown $UID:$UID -R /var/run/kubernetes/
and sudo chown $UID:$UID /tmp/kube-apiserver-audit.log
. I've submitted fixes for this in k8s already, but not present in 1.26 :)
If you do this, does it work for you? Can you try to find what is the difference between this and the ubuntu node you are using? And care to share how is it installed (all components, the OS, the container runtime, CNI, runc, etc.)?
Regarding the bats error you pasted here:
This is because you installed an old version of bats. If you install latest from source, it will not throw that error: https://bats-core.readthedocs.io/en/stable/installation.html#any-os-installing-bats-from-source
The tests run fine in ubuntu 22.04 with a new version of bats (expected, as that is tested in the CI too IIRC).
Regarding the config.json you sent me on slack: config-slack.txt. This is a config.json that was created on COS when you hit the issue in the k8s cluster. With that config.json I could repro the issue in Ubuntu 22.04. But it seems like a red herring.
First, I deleted the somax sysctl line, that isn't added on Ubuntu when I started my k8s cluster here (I guess COS has some specific config to add those?).
When the mount is pointing to the host /etc/resolv.conf, in ubuntu 22.04 (at least in this default config on an Azure VM) that file is a symlink. If you copy the file (cp /etc/resolv.conf .) and then point to this new file, the container starts fine.
It also starts fine if you keep the host /etc/resolv.conf and change the option
to add "nosuid","nodev", "noexec"
.
Do you have a config.json generated in your k8s your setup when it fails with Ubuntu?
Also, can you try in COS if adding those options to the mount makes it work?
I'm out of ideas on how I can reproduce this. Besides the missing things that you will send when you can, also:
- Can you confirm if you install a k8s cluster with runc 1.1.4, containerd 1.7.0 and k8s 1.25 or 1.26 (NOT 1.27) in Ubuntu 22.04 that creating a pod with user namespaces like the one here works?
- Can you confirm that the Ubuntu issue is only when running runc manually with the config.json generated on COS?
Regarding COS, whenever you send the other info we'll have more insight into what might be happening. My gut feeling now is that it might be configured to use some options in the mount that don't work with userns, although not sure how they achieve to add those options in the config.json.
But let me know if in your setup with Ubuntu this works fine (is this a GKE cluster?).
Thanks for all the details!
Ubuntu
Like I mentioned in chat, Ubuntu works if I remove the somaxconn sysctl. It seems like in GKE kubelet was adding that to all pods, and I turned that "feature" off. Now the pod is able to come up just fine.
Ill have to followup on whether somaxconn
is intended to work for user namespace pods or not.
COS
For COS when I update the option
for the /etc/resolv.conf mount with "nosuid","nodev", "noexec"
in the config.json
file I shared earlier, it is now able to mount /etc/resolv.conf but now I get a new failure:
runc --debug --systemd-cgroup run mycontainer
DEBU[0000] nsexec[14931]: => nsexec container setup
DEBU[0000] nsexec[14931]: update /proc/self/oom_score_adj to '-998'
DEBU[0000] nsexec-0[14931]: ~> nsexec stage-0
DEBU[0000] nsexec-0[14931]: spawn stage-1
DEBU[0000] nsexec-0[14931]: -> stage-1 synchronisation loop
DEBU[0000] nsexec-1[14935]: ~> nsexec stage-1
DEBU[0000] nsexec-1[14935]: unshare user namespace
DEBU[0000] nsexec-1[14935]: request stage-0 to map user namespace
DEBU[0000] nsexec-0[14931]: stage-1 requested userns mappings
DEBU[0000] nsexec-0[14931]: update /proc/14935/uid_map to '0 2121269248 65536
'
DEBU[0000] nsexec-0[14931]: update /proc/14935/gid_map to '0 2121269248 65536
'
DEBU[0000] nsexec-1[14935]: request stage-0 to map user namespace
DEBU[0000] nsexec-1[14935]: unshare remaining namespace (except cgroupns)
DEBU[0000] nsexec-1[14935]: spawn stage-2
DEBU[0000] nsexec-1[14935]: request stage-0 to forward stage-2 pid (14936)
DEBU[0000] nsexec-0[14931]: stage-1 requested pid to be forwarded
DEBU[0000] nsexec-0[14931]: forward stage-1 (14935) and stage-2 (14936) pids to runc
DEBU[0000] nsexec-2[1]: ~> nsexec stage-2
DEBU[0000] nsexec-1[14935]: signal completion to stage-0
DEBU[0000] nsexec-0[14931]: stage-1 complete
DEBU[0000] nsexec-0[14931]: <- stage-1 synchronisation loop
DEBU[0000] nsexec-0[14931]: -> stage-2 synchronisation loop
DEBU[0000] nsexec-0[14931]: signalling stage-2 to run
DEBU[0000] nsexec-1[14935]: <~ nsexec stage-1
DEBU[0000] nsexec-2[1]: signal completion to stage-0
DEBU[0000] nsexec-0[14931]: stage-2 complete
DEBU[0000] nsexec-0[14931]: <- stage-2 synchronisation loop
DEBU[0000] nsexec-0[14931]: <~ nsexec stage-0
DEBU[0000] nsexec-2[1]: <= nsexec container setup
DEBU[0000] nsexec-2[1]: booting up go runtime ...
DEBU[0000] child process in init()
DEBU[0000]libcontainer/cgroups/systemd/common.go:296 libcontainer/cgroups/systemd.generateDeviceProperties() skipping device /dev/char/10:200 for systemd: stat /dev/char/10:200: no such file or directory
ERRO[0000]utils.go:61 main.fatalWithCode() runc run failed: unable to start container process: exec /pause: permission denied
Thanks for all the details!
You are welcome.
But please, please, PLEASE understand that I'm spending a lot of time on this, and in part is due to your concise bug reports and replies you post here, that just don't say enough. With Ubuntu I can spin a server and spend a lot of time (even though having to spend lot of time has a significant impact on my day to day tasks), but with COS that is either a closed-source or only runs on google cloud, this is of course even way more difficult.
For example, if you had mentioned that in Ubuntu the sysctl was the only issue, that is a feature you have on GKE and that you can turn it off, and when you do that then all works fine, that would have saved me several HOURS to try it, write the elaborated post I did here with clear step-by-step instructions, etc.
Another thing that will help is if you say exactly how you run something to produce some output. If it is a k8s cluster, then any details needed in the setup, etc. Talking more in general, please think others don't know anything else than what you write. So explaining exactly what you did is critical, and maybe also ask yourself some questions before submitting, like: is this enough for someone on another laptop to reproduce what I have here, or is something missing or that can be interpreted in some other way? Am I being as clear as I can with what I write?
I'd really need you to start answering the questions I ask, if you can't answer some now and plan to answer them later, please do say so. And follow on what you said you will do later (so far so said you will follow up on some, but you didn't and I don't know if you consider that is not relevant anymore or you will do it later; it seems weird as for some things you do spend time, but for others you don't and I don't understand).
Also, I think if you don't know why something fixes something for you, then don't open PRs doing that change. We are debugging, we need first to understand what is happening (and we will find ways were things work while debugging, maybe more than one), and only then we can propose a fix (if any is really needed). If we do try something for debugging, that seems to help with some thing and you don't really understand why it helps nor it fixes the problem, then opening a PR for that is not the flow that I expect. Let's debug and understand first. We can open PRs later.
Like I mentioned in chat, Ubuntu works if I remove the somaxconn sysctl. It seems like in GKE kubelet was adding that to all pods, and I turned that "feature" off. Now the pod is able to come up just fine.
This is not what I understood at all of what you said in the chat. but great it works! I'll need you once again to be more verbose here. How do you disable the "feature" in GKE? This bug report can be useful for others only if you share this.
Also, is that enabled by default on GKE? Is it part of some kubernetes upstream project? Or how is GKE adding that?
Ill have to followup on whether
somaxconn
is intended to work for user namespace pods or not.
Cool, but please do follow-up on this.
Regarding COS, is the filesystem mounting the resolv.conf from when running from k8s, is mounted with those flags?
If not that, is the file a symlink or is there a symlink in any of the paths components to the resolv.conf?
Also, you didn't mention anything at all, but did that containerd patch make the flow from k8s work? I mean create a pod from k8s, using the patched containerd, and the pod is started with userns.
Regarding the last issue you pasted about permission denied, can you try if it is fixed with runc from git with this PR applied? https://github.com/opencontainers/runc/pull/3753
@rata - I appreciate your help on this but please allow my to clarify.
I have mentioned both in chat and in the report now that Ubuntu seems to be working once I remove the somaxconn sysctl. I had posted that Ubuntu works if I remove the sysctl a week ago. I did not know why those sysctls were being added in GKE myself and only learnt about that feature recently and am still following up on how that can be turned off by someone. AFAIK there isn't a way to turn it off in GKE, other than messing with the kubelet config which is what I did but I am following up with the GKE team to understand more.
I initially repro'd this issue on COS and only switch the repo to Ubuntu because in comment you mentioned:
- Find a way to repro this on debian or some other distro that we can have easy acces on. This would help a lot
Again sorry for the confusion about this and the wasted hours. But if you are not clear on some details in chat please don't hesitate to clarify. You have been extremely gracious with your time and I don't want you to waste it because of miscommunication.
Let me try to give as much details as I can about my setup now:
Ubuntu
I created a GKE cluster using the following command:
gcloud container clusters create host-user-vinaygo-ubuntu --num-nodes=1 --cluster-version=1.26.1-gke.1500 --enable-kubernetes-alpha --no-enable-autorepair --no-enable-autoupgrade --release-channel=rapid --image-type=ubuntu_containerd
Notice that the cluster above only has 1 node. After the cluster came up I ssh'd into the node and did the following:
Ubuntu Version
cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.1 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
uname -a
Linux gke-host-user-vinaygo-ub-default-pool-03c51c97-2nds 5.15.0-1024-gke #29-Ubuntu SMP Fri Dec 16 06:28:18 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Install containerd
1.7.0
I ran the following commands under /home/kubernetes/bin/
wget -c https://github.com/containerd/containerd/releases/download/v1.7.0/containerd-1.7.0-linux-amd64.tar.gz
tar -xzf containerd-1.7.0-linux-amd64.tar.gz
mount --bind /home/kubernetes/bin/bin/containerd /usr/bin/containerd
systemctl restart containerd.service
Install runc
1.1.4
I ran the following commands under /home/kubernetes/bin/
wget -c https://github.com/opencontainers/runc/releases/download/v1.1.4/runc.amd64
chmod u+x /home/kubernetes/bin/runc.amd64
mount --bind /home/kubernetes/bin/runc.amd64 /usr/sbin/runc
Updated kubelet to not add somaxconn to any pod
sed -i 's/net.core.somaxconn=1024,//g' /etc/default/kubelet
Created a Pod
Now that the node was setup correctly I created the following Pod.
gcloud container clusters get-credentials host-user-vinaygo-ubuntu
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: namespace-user-vinaygo
spec:
hostUsers: false
containers:
- name: namespace-user-vinaygo
image: debian
command:
- sleep
- infinity
EOF
Note that this pod comes up fine.
COS
I created a GKE cluster using the following command:
gcloud container clusters create host-user-vinaygo-cos --num-nodes=1 --cluster-version=1.26.1-gke.1500 --enable-kubernetes-alpha --no-enable-autorepair --no-enable-autoupgrade --release-channel=rapid --image-type=cos_containerd
Notice that the cluster above only has 1 node. After the cluster came up I ssh'd into the node and did the following:
COS version
NAME="Container-Optimized OS"
ID=cos
PRETTY_NAME="Container-Optimized OS from Google"
HOME_URL="https://cloud.google.com/container-optimized-os/docs"
BUG_REPORT_URL="https://cloud.google.com/container-optimized-os/docs/resources/support-policy#contact_us"
GOOGLE_METRICS_PRODUCT_ID=26
KERNEL_COMMIT_ID=44456f0e9d2cd7a9616fb0d05bc4020237839a5a
GOOGLE_CRASH_ID=Lakitu
VERSION=101
VERSION_ID=101
BUILD_ID=17162.40.56
uname -a
Linux gke-host-user-vinaygo-co-default-pool-56de25b8-1kzv 5.15.65+ #1 SMP Sat Jan 21 10:12:05 UTC 2023 x86_64 Intel(R) Xeon(R) CPU @ 2.20GHz GenuineIntel GNU/Linux
Install containerd
1.7.0
I ran the following commands under /home/kubernetes/bin/
wget -c https://github.com/containerd/containerd/releases/download/v1.7.0/containerd-1.7.0-linux-amd64.tar.gz
tar -xzf containerd-1.7.0-linux-amd64.tar.gz
mount --bind /home/kubernetes/bin/bin/containerd /usr/bin/containerd
systemctl restart containerd.service
Install runc
1.1.4 but also store the config.json before calling runc
I ran the following commands under /home/kubernetes/bin/
wget -c https://github.com/opencontainers/runc/releases/download/v1.1.4/runc.amd64
chmod u+x /home/kubernetes/bin/runc.amd64
cat > runcwrapper << 'EOF'
#!/bin/bash
echo "Starting my runc: $(date)" >> /tmp/runc-wrapper.log
echo "The command line args are $@" >> /tmp/runc-wrapper.log
if [ "${9}" = "--bundle" ]; then
echo "Getting config.json" >> /tmp/runc-wrapper.log
mkdir -p /tmp/runc-wrapper-debug-k8s/
cp -ar "${10}" "/tmp/runc-wrapper-debug-k8s/$$/"
fi
exec /home/kubernetes/bin/runc.amd64 --debug "$@"
EOF
chmod u+x /home/kubernetes/bin/runcwrapper
mount --bind /home/kubernetes/bin/runcwrapper /usr/bin/runc
Updated kubelet to not add somaxconn to any pod
sed -i 's/net.core.somaxconn=1024,//g' /etc/default/kubelet
Created a Pod
Now that the node was setup correctly I created the following Pod.
gcloud container clusters get-credentials host-user-vinaygo-cos
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: namespace-user-vinaygo
spec:
hostUsers: false
containers:
- name: namespace-user-vinaygo
image: debian
command:
- sleep
- infinity
EOF
The pod is stuck in CreatingContainer state.
Now I investigate the config.json
ls /tmp/runc-wrapper-debug-k8s/
8917 9004 9119
cd /tmp/runc-wrapper-debug-k8s/9119
ls -la
total 24
drwx--x--- 3 root 1354629120 200 Mar 27 21:05 .
drwxr-xr-x 31 root root 620 Mar 27 21:11 ..
-rw-r--r-- 1 root root 89 Mar 27 21:05 address
-rw-r--r-- 1 root root 9722 Mar 27 21:05 config.json
prwx------ 1 root root 0 Mar 27 21:05 log
-rw------- 1 root root 23 Mar 27 21:05 options.json
drwxr-xr-x 2 1354629120 1354629120 80 Mar 27 21:05 rootfs
-rw------- 1 root root 0 Mar 27 21:05 runtime
-rw------- 1 root root 32 Mar 27 21:05 shim-binary-path
lrwxrwxrwx 1 root root 121 Mar 27 21:05 work -> /var/lib/containerd/io.containerd.runtime.v2.task/k8s.io/868a8dcae720dbef6d70ca24f8df075b231d70c4d4a2868369dac11577cf4e02
Now I first try to run this config.json
I run the following command in /tmp/runc-wrapper-debug-k8s/9119 which has the config.json file
/tmp/runc-wrapper-debug-k8s/9119 # /home/kubernetes/bin/runc.amd64 --debug --systemd-cgroup run mycontainer
DEBU[0000] nsexec[13009]: => nsexec container setup
DEBU[0000] nsexec[13009]: update /proc/self/oom_score_adj to '-998'
DEBU[0000] nsexec-0[13009]: ~> nsexec stage-0
DEBU[0000] nsexec-0[13009]: spawn stage-1
DEBU[0000] nsexec-0[13009]: -> stage-1 synchronisation loop
DEBU[0000] nsexec-1[13014]: ~> nsexec stage-1
DEBU[0000] nsexec-1[13014]: unshare user namespace
DEBU[0000] nsexec-1[13014]: request stage-0 to map user namespace
DEBU[0000] nsexec-0[13009]: stage-1 requested userns mappings
DEBU[0000] nsexec-0[13009]: update /proc/13014/uid_map to '0 1354629120 65536
'
DEBU[0000] nsexec-0[13009]: update /proc/13014/gid_map to '0 1354629120 65536
'
DEBU[0000] nsexec-1[13014]: request stage-0 to map user namespace
DEBU[0000] nsexec-1[13014]: unshare remaining namespace (except cgroupns)
FATA[0000] nsexec-0[13009]: failed to open mount source /run/containerd/io.containerd.grpc.v1.cri/sandboxes/868a8dcae720dbef6d70ca24f8df075b231d70c4d4a2868369dac11577cf4e02/shm: No such file or directory
FATA[0000] nsexec-1[13014]: failed to receive fd from unix socket 8: Invalid argument
ERRO[0000]utils.go:61 main.fatalWithCode() runc run failed: unable to start container process: can't get final child's PID from pipe: EOF
This error is only because containerd cleaned up the sandbox.
So I update the config.json by:-
sed -i 's@/run/containerd/io.containerd.grpc.v1.cri/sandboxes/868a8dcae720dbef6d70ca24f8df075b231d70c4d4a2868369dac11577cf4e02/shm@/dev/shm@g' config.json
now I reurn
/tmp/runc-wrapper-debug-k8s/9119 # /home/kubernetes/bin/runc.amd64 --debug --systemd-cgroup run mycontainer
DEBU[0000] nsexec[14379]: => nsexec container setup
DEBU[0000] nsexec[14379]: update /proc/self/oom_score_adj to '-998'
DEBU[0000] nsexec-0[14379]: ~> nsexec stage-0
DEBU[0000] nsexec-0[14379]: spawn stage-1
DEBU[0000] nsexec-0[14379]: -> stage-1 synchronisation loop
DEBU[0000] nsexec-1[14381]: ~> nsexec stage-1
DEBU[0000] nsexec-1[14381]: unshare user namespace
DEBU[0000] nsexec-1[14381]: request stage-0 to map user namespace
DEBU[0000] nsexec-1[14381]: request stage-0 to map user namespace
DEBU[0000] nsexec-0[14379]: stage-1 requested userns mappings
DEBU[0000] nsexec-0[14379]: update /proc/14381/uid_map to '0 1354629120 65536
'
DEBU[0000] nsexec-0[14379]: update /proc/14381/gid_map to '0 1354629120 65536
'
DEBU[0000] nsexec-1[14381]: unshare remaining namespace (except cgroupns)
FATA[0000] nsexec-0[14379]: failed to open mount source /var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/868a8dcae720dbef6d70ca24f8df075b231d70c4d4a2868369dac11577cf4e02/resolv.conf: No such file or directory
FATA[0000] nsexec-1[14381]: failed to receive fd from unix socket 8: Invalid argument
ERRO[0000]utils.go:61 main.fatalWithCode() runc run failed: unable to start container process: can't get final child's PID from pipe: EOF
This error is because containerd cleaned up the sandbox.
So I update the config.json by:-
sed -i 's@/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/868a8dcae720dbef6d70ca24f8df075b231d70c4d4a2868369dac11577cf4e02/resolv.conf@/etc/resolv.conf@g' config.json
now I rerun the following
/tmp/runc-wrapper-debug-k8s/9119 # /home/kubernetes/bin/runc.amd64 --debug --systemd-cgroup run mycontainer
DEBU[0000] nsexec[15862]: => nsexec container setup
DEBU[0000] nsexec[15862]: update /proc/self/oom_score_adj to '-998'
DEBU[0000] nsexec-0[15862]: ~> nsexec stage-0
DEBU[0000] nsexec-0[15862]: spawn stage-1
DEBU[0000] nsexec-0[15862]: -> stage-1 synchronisation loop
DEBU[0000] nsexec-1[15868]: ~> nsexec stage-1
DEBU[0000] nsexec-1[15868]: unshare user namespace
DEBU[0000] nsexec-1[15868]: request stage-0 to map user namespace
DEBU[0000] nsexec-0[15862]: stage-1 requested userns mappings
DEBU[0000] nsexec-0[15862]: update /proc/15868/uid_map to '0 1354629120 65536
'
DEBU[0000] nsexec-0[15862]: update /proc/15868/gid_map to '0 1354629120 65536
'
DEBU[0000] nsexec-1[15868]: request stage-0 to map user namespace
DEBU[0000] nsexec-1[15868]: unshare remaining namespace (except cgroupns)
DEBU[0000] nsexec-1[15868]: spawn stage-2
DEBU[0000] nsexec-1[15868]: request stage-0 to forward stage-2 pid (15869)
DEBU[0000] nsexec-0[15862]: stage-1 requested pid to be forwarded
DEBU[0000] nsexec-0[15862]: forward stage-1 (15868) and stage-2 (15869) pids to runc
DEBU[0000] nsexec-2[1]: ~> nsexec stage-2
DEBU[0000] nsexec-1[15868]: signal completion to stage-0
DEBU[0000] nsexec-0[15862]: stage-1 complete
DEBU[0000] nsexec-0[15862]: <- stage-1 synchronisation loop
DEBU[0000] nsexec-0[15862]: -> stage-2 synchronisation loop
DEBU[0000] nsexec-0[15862]: signalling stage-2 to run
DEBU[0000] nsexec-1[15868]: <~ nsexec stage-1
DEBU[0000] nsexec-2[1]: signal completion to stage-0
DEBU[0000] nsexec-0[15862]: stage-2 complete
DEBU[0000] nsexec-0[15862]: <- stage-2 synchronisation loop
DEBU[0000] nsexec-0[15862]: <~ nsexec stage-0
DEBU[0000] nsexec-2[1]: <= nsexec container setup
DEBU[0000] nsexec-2[1]: booting up go runtime ...
DEBU[0000] child process in init()
ERRO[0000]utils.go:61 main.fatalWithCode() runc run failed: unable to start container process: error during container init: error mounting "/etc/resolv.conf" to rootfs at "/etc/resolv.conf": mount /proc/self/fd/7:/etc/resolv.conf (via /proc/self/fd/8), flags: 0x5021: operation not permitted
Here are the following permissions on COS:
ls -dl / /var /var/lib/ /var/lib/containerd/ /var/lib/containerd/io.containerd.grpc.v1.cri/ /var/lib/containerd/io.containerd.grpc.v1.cri/containers/ /var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes /var/lib/containerd/io.containerd.grpc.v1.cri/containers/17459338903803feb96dbcc21fabda6bf4f89d259be7b27964370a119513b723/ /var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/02101d92680d89025fdb18bf26656b41cc859f1f8fe515ed587c0374d9673bad/
drwxr-xr-x 20 root root 4096 Jan 21 10:59 /
drwxr-xr-x 9 root root 4096 Mar 27 20:48 /var
drwxr-xr-x 23 root root 4096 Mar 28 00:00 /var/lib/
drwxr-xr-x 12 root root 4096 Mar 27 21:03 /var/lib/containerd/
drwxr-xr-x 4 root root 4096 Mar 27 20:53 /var/lib/containerd/io.containerd.grpc.v1.cri/
drwxr-xr-x 25 root root 4096 Mar 27 20:55 /var/lib/containerd/io.containerd.grpc.v1.cri/containers/
drwxr-xr-x 2 root root 4096 Mar 27 21:02 /var/lib/containerd/io.containerd.grpc.v1.cri/containers/17459338903803feb96dbcc21fabda6bf4f89d259be7b27964370a119513b723/
drwxr-xr-x 15 root root 4096 Mar 28 00:12 /var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes
drwxr-xr-x 2 root root 4096 Mar 27 20:53 /var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/02101d92680d89025fdb18bf26656b41cc859f1f8fe515ed587c0374d9673bad/
ls -ld /dev/shm /etc/resolv.conf
drwxrwxrwt 2 root root 40 Mar 27 20:48 /dev/shm
lrwxrwxrwx 1 root root 32 Jan 21 10:34 /etc/resolv.conf -> /run/systemd/resolve/resolv.conf
ls -la /run/systemd/resolve
total 8
drwxr-xr-x 2 systemd-resolve systemd-resolve 100 Mar 27 20:55 .
drwxr-xr-x 24 root root 580 Mar 27 21:41 ..
srw-rw-rw- 1 systemd-resolve systemd-resolve 0 Mar 27 20:48 io.systemd.Resolve
-rw-r--r-- 1 systemd-resolve systemd-resolve 831 Mar 27 20:48 resolv.conf
-rw-r--r-- 1 systemd-resolve systemd-resolve 961 Mar 27 20:48 stub-resolv.conf
I also checked some mounts
mount | grep /etc
tmpfs on /etc/machine-id type tmpfs (ro,size=804600k,nr_inodes=819200,mode=755)
overlayfs on /etc type overlay (rw,relatime,lowerdir=/etc,upperdir=/tmp/etc_overlay/etc,workdir=/tmp/etc_overlay/.work)
now I looked at the mount like you suggested and noticed the following in config.json
{
"destination": "/dev/shm",
"type": "bind",
"source": "/dev/shm",
"options": [
"rbind",
"ro",
"nosuid",
"nodev",
"noexec"
]
},
{
"destination": "/etc/resolv.conf",
"type": "bind",
"source": "/etc/resolv.conf",
"options": [
"rbind",
"ro"
]
}
As you mentioned /etc/resolv.conf
did not have nosuid
, nodev
, noexec
so I updated config.json to add that. After update it looks like
{
"destination": "/dev/shm",
"type": "bind",
"source": "/dev/shm",
"options": [
"rbind",
"ro",
"nosuid",
"nodev",
"noexec"
]
},
{
"destination": "/etc/resolv.conf",
"type": "bind",
"source": "/etc/resolv.conf",
"options": [
"rbind",
"ro",
"nosuid",
"nodev",
"noexec"
]
}
now I rerun container using runc
/home/kubernetes/bin/runc.amd64 --debug --systemd-cgroup run mycontainer
DEBU[0000] nsexec[19004]: => nsexec container setup
DEBU[0000] nsexec[19004]: update /proc/self/oom_score_adj to '-998'
DEBU[0000] nsexec-0[19004]: ~> nsexec stage-0
DEBU[0000] nsexec-0[19004]: spawn stage-1
DEBU[0000] nsexec-0[19004]: -> stage-1 synchronisation loop
DEBU[0000] nsexec-1[19009]: ~> nsexec stage-1
DEBU[0000] nsexec-1[19009]: unshare user namespace
DEBU[0000] nsexec-1[19009]: request stage-0 to map user namespace
DEBU[0000] nsexec-0[19004]: stage-1 requested userns mappings
DEBU[0000] nsexec-0[19004]: update /proc/19009/uid_map to '0 1354629120 65536
'
DEBU[0000] nsexec-0[19004]: update /proc/19009/gid_map to '0 1354629120 65536
'
DEBU[0000] nsexec-1[19009]: request stage-0 to map user namespace
DEBU[0000] nsexec-1[19009]: unshare remaining namespace (except cgroupns)
DEBU[0000] nsexec-1[19009]: spawn stage-2
DEBU[0000] nsexec-1[19009]: request stage-0 to forward stage-2 pid (19010)
DEBU[0000] nsexec-0[19004]: stage-1 requested pid to be forwarded
DEBU[0000] nsexec-0[19004]: forward stage-1 (19009) and stage-2 (19010) pids to runc
DEBU[0000] nsexec-2[1]: ~> nsexec stage-2
DEBU[0000] nsexec-1[19009]: signal completion to stage-0
DEBU[0000] nsexec-0[19004]: stage-1 complete
DEBU[0000] nsexec-0[19004]: <- stage-1 synchronisation loop
DEBU[0000] nsexec-0[19004]: -> stage-2 synchronisation loop
DEBU[0000] nsexec-0[19004]: signalling stage-2 to run
DEBU[0000] nsexec-1[19009]: <~ nsexec stage-1
DEBU[0000] nsexec-2[1]: signal completion to stage-0
DEBU[0000] nsexec-0[19004]: stage-2 complete
DEBU[0000] nsexec-0[19004]: <- stage-2 synchronisation loop
DEBU[0000] nsexec-0[19004]: <~ nsexec stage-0
DEBU[0000] nsexec-2[1]: <= nsexec container setup
DEBU[0000] nsexec-2[1]: booting up go runtime ...
DEBU[0000] child process in init()
DEBU[0000]libcontainer/cgroups/systemd/common.go:296 libcontainer/cgroups/systemd.generateDeviceProperties() skipping device /dev/char/10:200 for systemd: stat /dev/char/10:200: no such file or directory
ERRO[0000]utils.go:61 main.fatalWithCode() runc run failed: unable to start container process: exec /pause: permission denied
Now it looks like the error for mounting is fixed but we have a different error.
now I update from git with this PR applied https://github.com/opencontainers/runc/pull/3753
I did the following on COS so I had to run the toolbox container
toolbox
apt-get install git wget libseccomp-dev
git clone https://github.com/kolyshkin/runc.git -b user-exec
cd runc
make static
cp runc /media/root/home/kubernetes/bin/runc.git
Then exited the toolbox container and I ran the following command in the same folder as config.json from previous attempts.
/tmp/runc-wrapper-debug-k8s/9119 # /home/kubernetes/bin/runc.git --debug --systemd-cgroup run mycontainer
DEBU[0000] nsexec[61499]: => nsexec container setup
DEBU[0000] nsexec[61499]: update /proc/self/oom_score_adj to '-998'
DEBU[0000] nsexec-0[61499]: ~> nsexec stage-0
DEBU[0000] nsexec-0[61499]: spawn stage-1
DEBU[0000] nsexec-0[61499]: -> stage-1 synchronisation loop
DEBU[0000] nsexec-1[61505]: ~> nsexec stage-1
DEBU[0000] nsexec-1[61505]: unshare user namespace
DEBU[0000] nsexec-1[61505]: request stage-0 to map user namespace
DEBU[0000] nsexec-0[61499]: stage-1 requested userns mappings
DEBU[0000] nsexec-0[61499]: update /proc/61505/uid_map to '0 1354629120 65536
'
DEBU[0000] nsexec-0[61499]: update /proc/61505/gid_map to '0 1354629120 65536
'
DEBU[0000] nsexec-1[61505]: request stage-0 to map user namespace
DEBU[0000] nsexec-1[61505]: unshare remaining namespaces (except cgroupns)
DEBU[0000] nsexec-1[61505]: request stage-0 to send mount sources
DEBU[0000] nsexec-0[61499]: stage-1 requested to open mount sources
DEBU[0000] nsexec-0[61499]: ~> sending fd for: /dev/shm
DEBU[0000] nsexec-0[61499]: ~> sending fd for: /etc/resolv.conf
DEBU[0000] nsexec-1[61505]: spawn stage-2
DEBU[0000] nsexec-1[61505]: request stage-0 to forward stage-2 pid (61506)
DEBU[0000] nsexec-0[61499]: stage-1 requested pid to be forwarded
DEBU[0000] nsexec-0[61499]: forward stage-1 (61505) and stage-2 (61506) pids to runc
DEBU[0000] nsexec-2[1]: ~> nsexec stage-2
DEBU[0000] nsexec-1[61505]: signal completion to stage-0
DEBU[0000] nsexec-0[61499]: stage-1 complete
DEBU[0000] nsexec-0[61499]: <- stage-1 synchronisation loop
DEBU[0000] nsexec-0[61499]: -> stage-2 synchronisation loop
DEBU[0000] nsexec-0[61499]: signalling stage-2 to run
DEBU[0000] nsexec-1[61505]: <~ nsexec stage-1
DEBU[0000] nsexec-2[1]: signal completion to stage-0
DEBU[0000] nsexec-0[61499]: stage-2 complete
DEBU[0000] nsexec-0[61499]: <- stage-2 synchronisation loop
DEBU[0000] nsexec-0[61499]: <~ nsexec stage-0
DEBU[0000] nsexec-2[1]: <= nsexec container setup
DEBU[0000] nsexec-2[1]: booting up go runtime ...
DEBU[0000] child process in init()
ERRO[0000]utils.go:62 main.fatalWithCode() runc run failed: unable to start container process: exec: "/pause": permission denied
That PR does not seem to fix this.
Regarding running bats in COS
I am using bats core 1.9.0 on COS and getting the following error.
../bats-core-1.9.0/bin/bats -t tests/integration/userns.bats
1..4
not ok 1 userns with simple mount
# (in test file tests/integration/userns.bats, line 34)
# `[ "$status" -eq 0 ]' failed
# runc spec (status=0):
#
# runc run test_busybox (status=1):
# time="2023-03-28T01:19:24Z" level=error msg="runc run failed: unable to start container process: exec: \"sh\": executable file not found in $PATH"
not ok 2 userns with 2 inaccessible mounts
# (in test file tests/integration/userns.bats, line 52)
# `[ "$status" -eq 0 ]' failed
# runc spec (status=0):
#
# runc run test_busybox (status=1):
# time="2023-03-28T01:19:24Z" level=error msg="runc run failed: unable to start container process: exec: \"sh\": executable file not found in $PATH"
not ok 3 userns with inaccessible mount + exec
# (in test file tests/integration/userns.bats, line 62)
# `[ "$status" -eq 0 ]' failed
# runc spec (status=0):
#
# runc run -d --console-socket /tmp/bats-run-86Yrds/runc.9V71dI/tty/sock test_busybox (status=1):
# time="2023-03-28T01:19:25Z" level=error msg="runc run failed: unable to start container process: exec: \"sh\": executable file not found in $PATH"
ok 4 userns with bind mount before a cgroupfs mount # skip test requires cgroups_v1
Regarding cgroupv1 and cgroupv2
I am using cgroupv2 as I am setting SystemdCgroup = true
option in containerd config and when I run runc command directly I set --systemd-cgroup
flag.
cat /etc/containerd/config.toml | grep Cgroup
SystemdCgroup = true
COS supports cgroup v2 I check by running the following command:-
grep cgroup /proc/filesystems
nodev cgroup
nodev cgroup2
I don't plan to test cgroup v1 as all GKE setup is configured to cgroup v2.
SupplementaryGroups=0
Since none of the host dirs in question are actually 750
I don't think setting this would change anything so I am not planning to do this.
Strace output
I already shared the strace output with your before but please let me know if you need strace again from any of the steps above.
Regarding the PR
According to containerd behavior of setting the options for other mounts it makes sense to add these options to /etc/resolv.conf
mounts as well. Thanks for the feedback but I disagree with you that it is not the right behavior for containerd to set these options consistently.
Hopefully that answers your questions. I was going to answer your questions today anyways but I only posted about the progress that was due to your idea about adding the options to the mount. I posted here before answering other questions because it was significant as the mount errors were fixed and I had already chatted to you about be working on the other things you asked for in slack. (they seemed less important compared to notifying you that changing mount options fixes the mounting error.)
I hope now we are on the same page and hopefully we can get to the bottom of this issue.
@rata - I think I got this work on COS.
Here is what I did:
- Created a GKE COS Cluster using
gcloud container clusters create host-user-vinaygo-cos --num-nodes=1 --cluster-version=1.26.1-gke.1500 --enable-kubernetes-alpha --no-enable-autorepair --no-enable-autoupgrade --release-channel=rapid --image-type=cos_containerd
- SSH into the node and build containerd with my PR https://github.com/containerd/containerd/pull/8309. I did not have https://github.com/opencontainers/runc/pull/3753.
toolbox
apt-get update
apt-get install git wget
wget -c https://go.dev/dl/go1.20.2.linux-amd64.tar.gz
tar -C /usr/local -xzf go1.20.2.linux-amd64.tar.gz
export PATH=$PATH:/usr/local/go/bin
git clone https://github.com/vinayakankugoyal/containerd.git -b fixresolv
cd containerd
make binaries
cp ./bin/containerd /media/root/home/bin/containerd.git
- Then I ran the following to run containerd that I just built
mount --bind /home/kubernetes/bin/containerd.git /usr/bin/containerd
systemctl restart containerd.service
- Then I updated the runc version to 1.1.4
wget -c https://github.com/opencontainers/runc/releases/download/v1.1.4/runc.amd64
chmod 777 /home/kubernetes/bin/runc.amd64
mount --bind /home/kubernetes/bin/runc.amd64 /usr/bin/runc
- Then I updated kubelet in GKE to not add somaxconn to all the pods
sed -i 's/net.core.somaxconn=1024,//g' /etc/default/kubelet
systemctl restart kubelet
- Then I created the following Pod to see if it comes up
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: namespace-user-vinaygo
spec:
hostUsers: false
containers:
- name: namespace-user-vinaygo
image: debian
command:
- sleep
- infinity
EOF
- Now back on the node I execed into the Pod to make sure it was running in user ns
crictl ps -a --name namespace-user
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD
55d0a000a2b3a f5b06fd900402 19 minutes ago Running namespace-user-vinaygo 0 f5f27e3f69294 namespace-user-vinaygo
crictl exec -it 55d0a000a2b3a /bin/bash
root@namespace-user-vinaygo:/# ps -aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 2396 568 ? Ss 20:08 0:00 sleep infinity
root 348 0.0 0.0 4032 3404 pts/0 Ss 20:29 0:00 /bin/bash
root 355 0.0 0.0 6760 2972 pts/0 R+ 20:29 0:00 ps -aux
root@namespace-user-vinaygo:/# readlink /proc/self/ns/user
user:[4026532465]
root@namespace-user-vinaygo:/# cat /proc/self/uid_map
0 3306815488 65536
root@namespace-user-vinaygo:/#
- Now on the node I checked what the uid of the process that was running
sleep infinity
command
ps ax o user:16,pid,command | grep "sleep infinity"
3306815488 34311 sleep infinity
root 40716 grep --colour=auto sleep infinity
As we can see the container process is running as UID 3306815488.
@vinayakankugoyal
I have mentioned both in chat and in the report now that Ubuntu seems to be working once I remove the somaxconn sysctl. I had posted that Ubuntu works if I remove the sysctl a week ago.
Sure, but that is manually running runc, not starting a k8s pod. My understanding was that runc was running, but the k8s pod for another reason than the sysctl was not working on Ubuntu either. Miscommunication, that is all :)
Let me try to give as much details as I can about my setup now:
Thanks, this report really helps A LOT.
COS
sed -i 's/net.core.somaxconn=1024,//g' /etc/default/kubelet
Ohh, great to know what you are doing. Then, can you paste the /etc/default/kubelet
file? Or at least the section mentioning this sysctl and sysctl relevant sections? I'd like to see if, as I imagine, the kubelet is allowing that unsafe sysctl to the list or what is doing with that.
I'm curious to understand what is GKE doing here. My guess is that the kubelet allows that unsafe sysctl to be used, and that a mutating webhook adds those sysctl to the pod or something like that. But unsure why if it is not safe on one node, how the hook realizes that... Maybe something completely different is happening?
To verify this, can you:
- Create a pod (without user namespaces, it doesn't matter now) before modifying the
/etc/default/kubelet
file and get the output of kubectl get pod -o yaml? I want to see if the sysctls are set in the pod security context or something - Get the same kubectl output, but after changing the kubelet config and for a pod with user namespaces enabled. To see that is indeed not there as it was before or something.
....
I also checked some mounts
mount | grep /etc tmpfs on /etc/machine-id type tmpfs (ro,size=804600k,nr_inodes=819200,mode=755) overlayfs on /etc type overlay (rw,relatime,lowerdir=/etc,upperdir=/tmp/etc_overlay/etc,workdir=/tmp/etc_overlay/.work)
Right, but as /etc/resolv.conf
is a symlink to another path outside etc, we need to see the mount options of them. Just for completeness, can you post the output of mount | grep /run
.
But what we are really interested at is the output of mount | grep var
, as that is where the original resolv.conf is mounted from. My guess is that it is mounted with those options, but let's verify to be sure.
....
DEBU[0000]libcontainer/cgroups/systemd/common.go:296 libcontainer/cgroups/systemd.generateDeviceProperties() skipping device /dev/char/10:200 for systemd: stat /dev/char/10:200: no such file or directory ERRO[0000]utils.go:61 main.fatalWithCode() runc run failed: unable to start container process: exec /pause: permission denied
I'm still curious on why you found this error when running manually. Is it something obvious (like not execute bit in the binary) or something like that, maybe due to some cp option missing?
There must be some difference to when those flags are added by containerd (that seems to work) and when we add them manually here...
Regarding cgroupsv1/v2: agree. It was relevant to know if you were using cgroups v1 as you could trigger some bugs with that, but if you are not using it, no need to try it out. Regarding supplemental groups: Those are not the only relevant directories, though. But yeah, no need to try it now that it works :)
Regarding the PR: my point was that we didn't know if this helps in any way to any real use case (we do know now). If we want to open the PR due to consistency, we should mention that. If we need these to make a real world OS work, we need to mention it. Until we know which case it is, we can't really open the PR and mention the reasons, so it can be properly reviewed (it is not the same to review the change and think this is needed to fix COS vs the author thinks this is nice to have)
@rata - I think I got this work on COS.
Great! Then why was the permissions denied error caused before, have you figured it out?
Regarding somax sysctl
Do you want to investigate further what can we do and follow-up on that? I'll check what crun does, just in case, too.
Regarding possible remount on runc
Do you want to open an issue here in runc and ask about remounting with those flags, even if they are not specified? Crun (another OCI compatible runtime) is doing that: https://github.com/containers/crun/blob/main/src/libcrun/linux.c#L919-L946.
We might want to do this in runc to keep compatibility with crun, maybe not. I think opening an issue to discuss with maintainers makes sense. If there is agreement on going down that route, if you want to implement it, it would be great! :)
Hope i'm not adding too much entropy to this discussion, this issue piqued my interest and following @vinayakankugoyal steps i managed to also repro it on COS.
With runc at 1.1.4
(fetched from github same way a above) and containerd at 1.6.2
(the current COS version) i then bisected the issue down to a7adeb69769395193a0278c4bda6068011d06cde, symptoms are the same as posted originally:
$ k describe pod namespace-user-lrascao
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting "/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/1091069a744cd525029de4a7b59d4f2ac3a9784bf2d64dbddb78b070e4f0481f/resolv.conf" to rootfs at "/etc/resolv.conf": mount /proc/self/fd/7:/etc/resolv.conf (via /proc/self/fd/9), flags: 0x5021: operation not permitted: unknown
With runc at
1.1.4
(fetched from github same way a above) and containerd at1.6.2
(the current COS version) i then bisected the issue down to a7adeb69769395193a0278c4bda6068011d06cde, symptoms are the same as posted originally:
@lrascao That is a commit on containerd, right? It seems to be the one I wrote "cri: Support pods with user namespaces".
Thanks for the effort. But it doesn't add any information really: before that commit user namespaces are not used, so all the user namespaces messages are ignored by containerd, so a regular pod is created. And with that commit, of course, the container with userns is created and due to the special mount options of COS, that fails in that environment.
Thanks anyways :)
I opened a discussion thread in runc regarding remounting bind mounts if they fail with the right options. https://github.com/opencontainers/runc/discussions/3801.
@vinayakankugoyal friendly ping? I'll be busy with Kubecon next week, but wanted to re-bump this