support v3 security.capability

Open cyphar opened this issue 7 years ago • 1 comments

Currently we basically ignore the semantics of different xattrs. This needs to change because of the "new" v3 security.capability changes. There are a few things we need to handle now that we didn't before:

On extraction we can now theoretically extract security.capability as an unprivileged user, and for a user namespace mapping. However this requires being inside a user namespace (so we'll have to fork+unshare) as well as requiring us to write v3 capabilities to disk which currently will require a libcap dependency as well as plenty of cgo code.
On extraction we also have to deal with the fact that v3 capabilities now encode the rootuid of the capability. If we see a v3 capability in an image we will have to remap it to whatever userspace mapping we are using before we write it to disk (see previous point about how we'd have to go about writing it to disk). LXD has code for this already (github.com/lxc/lxd/shared/idmap) but we might need to fork it.
On repacking we have to make sure we don't embed v3 capabilities inside our images. This is to avoid bad image unpackers (cough Docker) as well as older distributions being unable to use our images. Luckily, on unpacking, v2 capabilities are translated in the kernel to v3 capabilities -- so we can just repack v3 capabilities as v2 capabilities (just stripping the rootuid) in order to get around a whole host of issues. There is an argument that we shouldn't do this if the rootuid of the filesystem is different to the rootuid of the current mapping we are using (because the user might explicitly want to have v3 capabilities in the case) but we can detect that case pretty easily.

Sep 11 '18 05:09 cyphar

For privileged unpacking, this could be quite trivial because the uAPI for v2 and v3 caps is quite straightforward:

#define VFS_CAP_REVISION_MASK	0xFF000000
#define VFS_CAP_REVISION_SHIFT	24
#define VFS_CAP_FLAGS_MASK	~VFS_CAP_REVISION_MASK
#define VFS_CAP_FLAGS_EFFECTIVE	0x000001

#define VFS_CAP_REVISION_1	0x01000000
#define VFS_CAP_U32_1           1
#define XATTR_CAPS_SZ_1         (sizeof(__le32)*(1 + 2*VFS_CAP_U32_1))

#define VFS_CAP_REVISION_2	0x02000000
#define VFS_CAP_U32_2           2
#define XATTR_CAPS_SZ_2         (sizeof(__le32)*(1 + 2*VFS_CAP_U32_2))

#define VFS_CAP_REVISION_3	0x03000000
#define VFS_CAP_U32_3           2
#define XATTR_CAPS_SZ_3         (sizeof(__le32)*(2 + 2*VFS_CAP_U32_3))

#define XATTR_CAPS_SZ           XATTR_CAPS_SZ_3
#define VFS_CAP_U32             VFS_CAP_U32_3
#define VFS_CAP_REVISION	VFS_CAP_REVISION_3

struct vfs_cap_data {
	__le32 magic_etc;            /* Little endian */
	struct {
		__le32 permitted;    /* Little endian */
		__le32 inheritable;  /* Little endian */
	} data[VFS_CAP_U32];
};

/*
 * same as vfs_cap_data but with a rootid at the end
 */
struct vfs_ns_cap_data {
	__le32 magic_etc;
	struct {
		__le32 permitted;    /* Little endian */
		__le32 inheritable;  /* Little endian */
	} data[VFS_CAP_U32];
	__le32 rootid;
};

static __u32 sansflags(__u32 m)
{
	return m & ~VFS_CAP_FLAGS_EFFECTIVE;
}

static bool is_v2header(int size, const struct vfs_cap_data *cap)
{
	if (size != XATTR_CAPS_SZ_2)
		return false;
	return sansflags(le32_to_cpu(cap->magic_etc)) == VFS_CAP_REVISION_2;
}

static bool is_v3header(int size, const struct vfs_cap_data *cap)
{
	if (size != XATTR_CAPS_SZ_3)
		return false;
	return sansflags(le32_to_cpu(cap->magic_etc)) == VFS_CAP_REVISION_3;
}

But for rootless mode we would need to do lsetxattr inside a rootless userns. Unfortunately, doing this in Go would require CGo (even with re-exec) because Go doesn't support newuidmap/newgidmap...

May 24 '25 08:05 cyphar