support v3 security.capability
Currently we basically ignore the semantics of different xattrs. This needs to change because of the
"new" v3 security.capability changes. There are a few things we need to handle now that we didn't before:
-
On extraction we can now theoretically extract
security.capabilityas an unprivileged user, and for a user namespace mapping. However this requires being inside a user namespace (so we'll have to fork+unshare) as well as requiring us to write v3 capabilities to disk which currently will require alibcapdependency as well as plenty ofcgocode. -
On extraction we also have to deal with the fact that v3 capabilities now encode the
rootuidof the capability. If we see a v3 capability in an image we will have to remap it to whatever userspace mapping we are using before we write it to disk (see previous point about how we'd have to go about writing it to disk). LXD has code for this already (github.com/lxc/lxd/shared/idmap) but we might need to fork it. -
On repacking we have to make sure we don't embed v3 capabilities inside our images. This is to avoid bad image unpackers (cough Docker) as well as older distributions being unable to use our images. Luckily, on unpacking, v2 capabilities are translated in the kernel to v3 capabilities -- so we can just repack v3 capabilities as v2 capabilities (just stripping the
rootuid) in order to get around a whole host of issues. There is an argument that we shouldn't do this if therootuidof the filesystem is different to therootuidof the current mapping we are using (because the user might explicitly want to have v3 capabilities in the case) but we can detect that case pretty easily.
For privileged unpacking, this could be quite trivial because the uAPI for v2 and v3 caps is quite straightforward:
#define VFS_CAP_REVISION_MASK 0xFF000000
#define VFS_CAP_REVISION_SHIFT 24
#define VFS_CAP_FLAGS_MASK ~VFS_CAP_REVISION_MASK
#define VFS_CAP_FLAGS_EFFECTIVE 0x000001
#define VFS_CAP_REVISION_1 0x01000000
#define VFS_CAP_U32_1 1
#define XATTR_CAPS_SZ_1 (sizeof(__le32)*(1 + 2*VFS_CAP_U32_1))
#define VFS_CAP_REVISION_2 0x02000000
#define VFS_CAP_U32_2 2
#define XATTR_CAPS_SZ_2 (sizeof(__le32)*(1 + 2*VFS_CAP_U32_2))
#define VFS_CAP_REVISION_3 0x03000000
#define VFS_CAP_U32_3 2
#define XATTR_CAPS_SZ_3 (sizeof(__le32)*(2 + 2*VFS_CAP_U32_3))
#define XATTR_CAPS_SZ XATTR_CAPS_SZ_3
#define VFS_CAP_U32 VFS_CAP_U32_3
#define VFS_CAP_REVISION VFS_CAP_REVISION_3
struct vfs_cap_data {
__le32 magic_etc; /* Little endian */
struct {
__le32 permitted; /* Little endian */
__le32 inheritable; /* Little endian */
} data[VFS_CAP_U32];
};
/*
* same as vfs_cap_data but with a rootid at the end
*/
struct vfs_ns_cap_data {
__le32 magic_etc;
struct {
__le32 permitted; /* Little endian */
__le32 inheritable; /* Little endian */
} data[VFS_CAP_U32];
__le32 rootid;
};
static __u32 sansflags(__u32 m)
{
return m & ~VFS_CAP_FLAGS_EFFECTIVE;
}
static bool is_v2header(int size, const struct vfs_cap_data *cap)
{
if (size != XATTR_CAPS_SZ_2)
return false;
return sansflags(le32_to_cpu(cap->magic_etc)) == VFS_CAP_REVISION_2;
}
static bool is_v3header(int size, const struct vfs_cap_data *cap)
{
if (size != XATTR_CAPS_SZ_3)
return false;
return sansflags(le32_to_cpu(cap->magic_etc)) == VFS_CAP_REVISION_3;
}
But for rootless mode we would need to do lsetxattr inside a rootless userns. Unfortunately, doing this in Go would require CGo (even with re-exec) because Go doesn't support newuidmap/newgidmap...