sysbox `chattr +i` receives `Operation not permitted while setting flags` in pod with sysbox runtime.

Hi @ctalledo,

I am trying to get chattr +i filename working in a sysbox pod with the following definition:

apiVersion: v1
kind: Pod
metadata:
  name: aidan-test-some-things
  annotations:
    io.kubernetes.cri-o.userns-mode: "auto:size=65536"
spec:
  runtimeClassName: sysbox-runc
  containers:
  - name: ubu-bio-systemd-docker
    image: registry.nestybox.com/nestybox/ubuntu-bionic-systemd-docker
    command: ["/sbin/init"]
  restartPolicy: Never

Inside the pod I see the capability cap_linux_immutable (I configured crio to add this capability to the default set), and a full capability set as expected. However, when running chattr +i on a file, I get the following error:

root@aidan-test-some-things:~# touch test.txt
root@aidan-test-some-things:~# ls -la
total 16
drwx------ 1 root root 4096 Aug 19 21:19 .
dr-xr-xr-x 1 root root 4096 Aug 19 21:18 ..
-rw-r--r-- 1 root root 3106 Apr  9  2018 .bashrc
-rw-r--r-- 1 root root  148 Aug 17  2015 .profile
-rw-r--r-- 1 root root    0 Aug 19 21:19 test.txt
root@aidan-test-some-things:~# chattr +i test.txt
chattr: Operation not permitted while setting flags on test.txt

I have this same behavior on linux kernel 6.5.0 + k8s 1.29 + sysbox 0.6.4 and with linux kernel 5.15.0 + k8s 1.28 + sysbox 0.6.4. The outputs here are from the latter but they are nearly identical.

I have attached the strace output and the crio config to this issue. Let me know any other information that would be helpful or if I am missing something.

strace.txt crio-config.txt

Aug 19 '24 21:08 AidanAbd

Hey @ctalledo @rodnymolina, could I get a status update here? If y'all do not have time to look into it, I can start building from source and testing but this has become high priority for us.

Aug 28 '24 17:08 AidanAbd

Hi @AidanAbd, sorry for the delay in getting back to you.

I don't think you need to add any capabilities to CRIO's config since Sysbox enables them all to the init-process of every container.

Now, the issue that you are reporting doesn't seem trivial to me since we are getting an EPERM from the kernel while trying to execute that IOCTL that we see in the strace.

I went ahead and reproduced this issue in my own setup (regular docker+sysbox env), so there's nothing k8s-specific here. We'll need to look at this one in more details since I'm not sure why is the kernel complaining.

Sep 17 '24 05:09 rodnymolina

Sounds good. Excited for any updates but understand this one might take a bit longer.

Sep 17 '24 18:09 AidanAbd

Now, the issue that you are reporting doesn't seem trivial to me since we are getting an EPERM from the kernel while trying to execute that IOCTL that we see in the strace.

Correct; the specific IOCTL that is getting blocked (by the kernel) for the process running inside the Sysbox container is FS_IMMUTABLE_FL:

ioctl(3, FS_IOC_SETFLAGS, [FS_IMMUTABLE_FL|FS_EXTENT_FL]) = -1 EPERM (Operation not permitted)

That's unexpected, per the ioctl_iflags man page, a process with CAP_IMMUTABLE should be able to set this flag:

FS_IMMUTABLE_FL 'i'
The file is immutable: no changes are permitted to the file contents or metadata 
(permissions, timestamps, ownership, link count, and so on).  (This restriction 
applies even to the superuser.)  Only a privileged process (CAP_LINUX_IMMUTABLE) 
can set or clear this attribute.

The process inside the Sysbox container has that capability set within it's user namespace, so it's the kernel unexpectedly blocking the process from setting the immutability flag on the file. It seems the kernel is only allowing this from within the init user-namespace, which does not appear to be conceptually correct.

Jan 10 '25 02:01 ctalledo

It seems the kernel is only allowing this from within the init user-namespace, which does not appear to be conceptually correct.

This kernel behavior may be on purpose though: say the root user on a Linux host sets the immutability flag on a file, and then that file is mounted into a Sysbox container (i.e., user-namespace). Then it would be insecure to allow the root user in the Sysbox container (which maps to an unprivileged user in the host) to remove the immutability flag in that file.

But on the other hand, for a file that exists purely within a Sysbox container (i.e., not mounted from the host), the root user in the Sysbox container should ideally be able to set or clear the immutability flag on that file.

However since I don't believe the kernel has the mechanisms to differentiate between these scenarios, it's taking the conservative route and only allowing the flag to be set by processes that have CAP_LINUX_IMMUTABLE at host level (i.e., in the init user-namespace).

Jan 10 '25 02:01 ctalledo