sysbox icon indicating copy to clipboard operation
sysbox copied to clipboard

--storage-opt size=1mb no longer honnored

Open SkyperTHC opened this issue 2 years ago • 8 comments

It is no longer possible to limit the root filesize limit of a container with --storage-opt (and an attacker can exhaust the host's inode or size limit - which can result in docker daemon exiting ungracefully).

expected behaviour

# docker run --rm  --storage-opt size=1mb  alpine dd bs=1M count=2 if=/dev/zero of=/dump.zero
dd: error writing '/dump.zero': No space left on device
1+0 records in
0+0 records out

=> Correct to fail after 1mb

experienced behaviour:

# docker run --rm --runtime=sysbox-runc  --storage-opt size=1mb  alpine dd bs=1M count=2 if=/dev/zero of=/dump.zero
2+0 records in
2+0 records out

=> This should have failed after 1mb

Checking the underlaying FS shows that prj-id is 0 ('no limit set'):

# docker run --rm --runtime=sysbox-runc --name foobar --storage-opt size=1mb -d alpine sleep 100
# lsattr -dp $(docker inspect foobar --format '{{.GraphDriver.Data.UpperDir }}')
    0 --------------e----- /sf/docker/overlay2/ed8815741a4e829f3ef3f730a8f521eb4cd7e9dc859b129ec377d48111d5d8d0/diff

=> The first '0' should be the XFS project ID. I believe 0 means that the storage-limit has not been set.

Running the same without sysbox-runc shows that docker correctly sets the limit:

# docker run --rm --name foobar --storage-opt size=1mb -d alpine sleep 100
lsattr -dp $(docker inspect foobar --format '{{.GraphDriver.Data.UpperDir }}')
   63 -------------------- /sf/docker/overlay2/7edf9eef2393201602370ed59e697bccd0962e0a7e2f4299f9079981916e50db/diff

SkyperTHC avatar May 10 '23 06:05 SkyperTHC

Hi @SkyperTHC, thanks for filing the issue.

In general, Sysbox is mostly agnostic to the --storage-opt option, meaning that the higher-layer runtime (Docker / containerd) would set it on the containers filesystem and Sysbox need not be aware of it.

But since container tech is pretty complex under the covers, maybe Sysbox is doing something that breaks things.

Can you please provide info on your host? distro, kernel version, and underlying filesystem for /var/lib/docker.

Thanks!

ctalledo avatar May 10 '23 17:05 ctalledo

I created a band new t2.small AWS instance.

# grep VERSION= /etc/os-release
VERSION="22.04.2 LTS (Jammy Jellyfish)"
# uname -r
5.15.0-1031-aws
# docker --version
Docker version 23.0.6, build ef23cbc
# mount | grep -F /var/lib/docker
/dev/loop5 on /var/lib/docker type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,prjquota)

Problem (should fail, but does not fail after 1mb)

# docker run --rm --runtime=sysbox-runc  --storage-opt size=1mb  alpine dd bs=1M count=2 if=/dev/zero of=/dump.zero
2+0 records in
2+0 records out

I've also tried to make /var/lib an xfs (with prjquota) after noticing that sysbox creates some mount points to fuse anchored in /var/lib/sysboxfs. It did not solve the problem.

My gut feeling is that it has something to do how sysbox mounts to overay2/*/diff as an intermediate layer (but as type xfs rather then overlay (?)).

SkyperTHC avatar May 11 '23 08:05 SkyperTHC

Thanks for the info @SkyperTHC.

Can you also provide the output of:

docker run --rm --runtime=sysbox-runc  alpine sh -c "mount | grep overlay"

I think we may have a bug that shows up in kernels < 5.19: the container's root filesystem is on overlayfs, but ID-mapping of overlayfs is not supported until kernel 5.19 (needed for files to show up with proper ownership inside the container).

Sysbox knows this, and to make things work on kernels < 5.19 it clones the container root filesystem from /var/lib/docker/overlay2/<uuid> -> /var/lib/sysbox/rootfs/<uuid>, and then performs a very fast chown of the files in there. This way they show up properly inside the container.

The bug I suspect occurs because the cloned container's root filesystem is not under the effect of the storage-opt limits. Your output from the command above will help me confirm this. If that's the case we need to fix it.

As a work-around, you would need to upgrade your host to kernel 5.19 or later. In kernel 5.19, ID-mapping works on top of overlayfs, which means Sysbox no longer needs to clone anything. Sysbox will detect this automatically and do the right thing.

Hope that helps.

ctalledo avatar May 11 '23 17:05 ctalledo

# docker run --rm --runtime=sysbox-runc  alpine sh -c "mount | grep overlay"
overlay on / type overlay (rw,relatime,lowerdir=/var/lib/docker/overlay2/l/HGUDI5PFBMUX3O4APPGLCSC57R:/var/lib/docker/overlay2/l/QZUK4YM5VAVLOAYE5OAKAO6NTV,upperdir=/var/lib/sysbox/rootfs/b0e9a7049fddcd2076dd2a88a1999c5176ed750e368ae4a0024d4499e744b5f6/overlay2/diff,workdir=/var/lib/sysbox/rootfs/b0e9a7049fddcd2076dd2a88a1999c5176ed750e368ae4a0024d4499e744b5f6/overlay2/work,metacopy=on)
/dev/loop5 on /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,prjquota)

I'll upgrade to 5.19 shortly.

SkyperTHC avatar May 11 '23 20:05 SkyperTHC

Thanks @SkyperTHC; that confirms Sysbox is cloning the rootfs (e.g., upperdir=/var/lib/sysbox/rootfs/...), so it means there is a bug where the cloning is dropping the storage-opt restrictions.

I'll upgrade to 5.19 shortly.

Great, that should work around it until we fix it.

ctalledo avatar May 11 '23 20:05 ctalledo

Just to confirm that your suggestion to upgrade to 5.19 works. Thanks for the great work.

# uname -mrs
Linux 5.19.17-051917-generic x86_64
# docker run --rm --runtime=sysbox-runc  alpine sh -c "mount | grep overlay"
overlay on / type overlay (rw,relatime,lowerdir=/sf/docker/overlay2/l/5GWTCEJ6LCE4C64X5TVYJAIFOJ:/sf/docker/overlay2/l/BRFKV5OR6DEE3M7WLCI6HVGL3E,upperdir=/sf/docker/overlay2/d211e013d6f80fc01121e3df5d43e98e99f6ab226dd7f6d8ba3437f9e4106050/diff,workdir=/sf/docker/overlay2/d211e013d6f80fc01121e3df5d43e98e99f6ab226dd7f6d8ba3437f9e4106050/work)
/dev/sda2 on /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs type ext4 (rw,relatime,idmapped)

then the write test fails after 1mb (as it should be):

# docker run --rm --runtime=sysbox-runc  --storage-opt size=1mb  alpine dd bs=1M count=2 if=/dev/zero of=/dump.zero
dd: error writing '/dump.zero': No space left on device
1+0 records in
0+0 records out

SkyperTHC avatar May 11 '23 21:05 SkyperTHC

Just to confirm that your suggestion to upgrade to 5.19 works. Thanks for the great work.

Awesome, thanks for confirming. Let's leave this issue open to track the fix for kernels < 5.19.

Thanks again for giving Sysbox a shot and reporting the issues!

ctalledo avatar May 11 '23 21:05 ctalledo

Same issue here. I'm using Ubuntu Server 22.04 LTS which comes with kernel version 5.15.

mhemrg avatar Dec 10 '23 07:12 mhemrg