janitor
janitor copied to clipboard
Enable debugger support for containers
(untested)
Thanks a lot for working on enabling rr
in Janitor containers!
@jld could you please confirm that adding CAP_SYS_PTRACE
and using seccomp=unconfined
on all our Docker containers isn't too risky from a security standpoint? (Taking into account that our users are sudo
in their containers, but we don't want them to be root
in our Docker servers.)
@rocallahan does adding CAP_SYS_PTRACE
and using seccomp=unconfined
seem correct for allowing rr
to work inside Docker containers? (Assuming that /proc/sys/kernel/perf_event_paranoid
is already 1
on the server.)
does adding CAP_SYS_PTRACE and using seccomp=unconfined seem correct for allowing rr to work inside Docker containers?
Yes.
Before we move forward with this, I'd like to be certain that adding CAP_SYS_PTRACE
and using seccomp=unconfined
on all Janitor's Docker containers isn't too risky from a security standpoint.
Ping @jld ?
I'd really like to find a solution where we only enable the syscalls actually needed for rr
, instead of all 44 potentially dangerous syscalls that are blocked by default.
The default seccomp
profile includes these lines:
{
"names": [
"kcmp",
"process_vm_readv",
"process_vm_writev",
"ptrace"
],
"action": "SCMP_ACT_ALLOW",
"args": [],
"comment": "",
"includes": {
"caps": [
"CAP_SYS_PTRACE"
]
},
"excludes": {}
}
Am I interpreting these correctly if I conclude that adding CAP_SYS_PTRACE
will also make seccomp
allow the syscalls ptrace
, kcmp
, process_vm_readv
and process_vm_writev
? EDIT: No, these syscalls are blocked by both seccomp and the dropped SYS_PTRACE
.
@Martiusweb speculates that rr
only needs ptrace
, perf_event_open
and maybe also userfaultfd
, and he suggests that we try enabling these first, and iterate on any missing syscalls afterwards.
My preferred outcome would be to fork the default seccomp
profile to make rr
work, and propose our changes upstream or rebase our fork from time to time.
@Martiusweb speculates that rr only needs ptrace, perf_event_open and maybe also userfaultfd, and he suggests that we try enabling these first, and iterate on any missing syscalls afterwards.
That should work.
We don't need userfaultfd
.
So here are the syscalls used by rr
during a very simple record & replay session, and their seccomp
specificities if any:
access: (allowed)
arch_prctl: "arches": [ "amd64", "x32" ]
bind: (allowed)
brk: (allowed)
chmod: (allowed)
clone: "caps": [ "CAP_SYS_ADMIN" ]
close: (allowed)
execve: (allowed)
exit_group: (allowed)
fcntl: (allowed)
fstat: (allowed)
ftruncate: (allowed)
futex: (allowed)
getrlimit: (allowed)
getsockopt: (allowed)
getxattr: (allowed)
ioctl: (allowed)
kill: (allowed)
link: (allowed)
listen: (allowed)
lseek: (allowed)
lstat: (allowed)
mkdir: (allowed)
mmap: (allowed)
mprotect: (allowed)
munmap: (allowed)
open: (allowed)
perf_event_open: "caps": [ "CAP_SYS_ADMIN" ]
pipe2: (allowed)
poll: (allowed)
prctl: (allowed)
pread64: (allowed)
ptrace: "caps": [ "CAP_SYS_PTRACE" ]
pwrite64: (allowed)
read: (allowed)
readlink: (allowed)
recvmsg: (allowed)
rename: (allowed)
rt_sigaction: (allowed)
rt_sigprocmask: (allowed)
rt_sigreturn: (allowed)
sched_setaffinity: (allowed)
set_robust_list: (allowed)
set_tid_address: (allowed)
setitimer: (allowed)
setrlimit: (allowed)
socket: (allowed)
stat: (allowed)
statfs: (allowed)
symlink: (allowed)
syscall_446: ???
uname: (allowed)
unlink: (allowed)
wait4: (allowed)
write: (allowed)
So I conclude that to make rr
work inside Docker containers, we need to enable the following 3 syscalls (annotated by the reason Docker blocks it by default):
-
clone
: Deny cloning new namespaces. Also gated byCAP_SYS_ADMIN
forCLONE_*
flags, exceptCLONE_USERNS
. -
perf_event_open
: Tracing/profiling syscall, which could leak a lot of information on the host. -
ptrace
: Tracing/profiling syscall, which could leak a lot of information on the host. Already blocked by droppingCAP_PTRACE
.
(I have no idea what syscall_446
is all about.)
I don't think you need to worry about clone
. rr does not create new namespaces (though it supports recording children that do). It uses clone
to create processes and threads, but I'm sure your seccomp policy would allow that.
Don't worry about 446 (and rr's other internal syscalls, 442 to 447). rr uses them internally and I think blocking them via seccomp will not affect rr's usage.
Though you might need to whitelist them.
Awesome, thanks @rocallahan for clarifying!
I think we can reasonably tweak Docker's default seccomp
profile to authorize perf_event_open
and ptrace
, in addition to adding CAP_SYS_PTRACE
.
There is a large obstacle here: passing custom seccomp profile requires a file on the Docker host, but currently we operate remotely without an agent.
There is a large obstacle here: passing custom seccomp profile requires a file on the Docker host, but currently we operate remotely without an agent.
Well, we do have a Janitor clone on each Docker host, which we use to properly configure Docker (and also run an authenticated Docker proxy). Maybe we can commit a custom seccomp
profile somewhere in the Janitor repository, update our host's checkouts, and point our Docker daemons to that file?
@jankeromnes has there been any movement on this? Having rr support would be fantastic, but what I currently really need is the ability to set breakpoints in gdb, which unfortunately also requires ptrace
and at least some seccomp
changes.
@tschneidereit Hello!
We can already add CAP_SYS_PTRACE
as a first step.
Unfortunately, we're still blocked on creating a fork of Docker's default seccomp profile which just enables these two additional syscalls:
-
perf_event_open
-
ptrace
Note: This pull request would indeed allow rr
to work (manually tested with @whimboo), but I believe that using seccomp=unconfined
(i.e. completely disabling seccomp) is too risky from a security standpoint. (But maybe I'm wrong, e.g. I believe that Kubernetes disables Docker's seccomp altogether, or did at some point.)
hi @jankeromnes , this is the most relevant resource on the nets on how safe SYS_PTRACE is for containers, or so it seems.
Kubernetes indeed allows to manually allow / apply any capabilities to either profiles or directly to pods/containers, bypassing naughty Docker.
But, is there a definitive answer on the safety of allowing SYS_PTRACE for all containers of a cluster? Are there any security risks? Also, is there any difference on CAP_SYS_PTRACE and SYS_PTRACE?
Hi @nikopen,
this is the most relevant resource on the nets on how safe SYS_PTRACE is for containers, or so it seems.
Hm, that's quite unfortunate, as it seems nobody here really understands the security implications of enabling these capabilities and syscalls in third-party controlled containers on shared infrastructure.
is there any difference on CAP_SYS_PTRACE and SYS_PTRACE?
No, I believe these are the same thing (i.e. the capability or unit of Linux superuser privileges called "PTRACE").
Kubernetes indeed allows to manually allow / apply any capabilities to either profiles or directly to pods/containers, bypassing naughty Docker.
Thank you for this information. I guess this means that Kubernetes disables Docker's seccomp because it already has its own permission checking mechanism (presumably with similar "safe defaults" regarding allowed or forbidden capabilities, in which case I'd love to know if they allow the CAP_SYS_PTRACE
capability or the perf_event_open
and ptrace
syscalls by default).
But, is there a definitive answer on the safety of allowing SYS_PTRACE for all containers of a cluster? Are there any security risks?
Unfortunately, we don't have a definitive answer for this. From my (limited) understanding, it seems that in the past, someone made a proof of concept using the PTRACE capability to "escape" a Docker container. I also believe (without proof) that this vulnerability has since been fixed, because the security of Docker's unprivileged containers has been improved, and I'm not aware of any recent PTRACE-related container escapes. However, I could be awfully wrong here, and I'd love to hear from someone more knowledgeable about this.
Thank you for this information. I guess this means that Kubernetes disables Docker's seccomp because it already has its own permission checking mechanism (presumably with similar "safe defaults" regarding allowed or forbidden capabilities, in which case I'd love to know if they allow the
CAP_SYS_PTRACE
capability or theperf_event_open
andptrace
syscalls by default).
Indeed, k8s disables Docker's seccomp, and it can be re-enabled via a specific annotation. The default ServiceAccount on an unconfigured 'vanilla' k8s installation is very open - allows pretty much everything, and then granular capabilities can be defined either via securityContext in a pod/deployment or via PodSecurityPolicies, to have unprivileged pods but allowing SYS_PTRACE et cetera.
Unfortunately, we don't have a definitive answer for this. From my (limited) understanding, it seems that in the past, someone made a proof of concept using the PTRACE capability to "escape" a Docker container. I also believe (without proof) that this vulnerability has since been fixed, because the security of Docker's unprivileged containers has been improved, and I'm not aware of any recent PTRACE-related container escapes. However, I could be awfully wrong here, and I'd love to hear from someone more knowledgeable about this.
That's interesting, is it related with this commit? https://github.com/torvalds/linux/commit/58d0a862f573c3354fa912603ef5a4db188774e7 I can see it entered the 4.8 kernel. I could mail the k8s security team to learn more about this domain, maybe they have something easily reachable to share.
That's interesting, is it related with this commit? torvalds/linux@58d0a86
As the commit message says, in the past a process could use ptrace() to negate any seccomp() policy imposed on it, by changing the syscall number after the seccomp policy had run. Therefore if you imposed any seccomp policy at all, you had to block ptrace() as well. That particular issue is fixed since 4.8 so that is no longer a reason to block ptrace() on those kernels.
Another reason to block ptrace() is that it's just a really complicated API which provides a considerable amount of attack surface through which an attacker might be able to exploit zero-day kernel bugs. It makes sense that if your containers don't need ptrace(), you should block it. This will always be true I guess.
I don't know of any bugs or features of ptrace() that enable container escapes or other security policy violations. I think if any were known, they would be treated as security bugs and fixed. At this point I think it should be treated like any other API that is thought to be safe but adds significant attack surface.
Update:
Docker's default seccomp profile now seems to authorize the ptrace
syscall if kernel >= 4.8
:
{
"names": [
"ptrace"
],
"action": "SCMP_ACT_ALLOW",
"args": null,
"comment": "",
"includes": {
"minKernel": "4.8"
},
"excludes": {}
},
(Source.)
If my interpretation is correct, this means that in order to allow the ptrace
syscall in Docker containers, you no longer need a custom seccomp
profile -- just a recent Docker (that has https://github.com/moby/moby/commit/1124543ca8071074a537a15db251af46a5189907) and Linux kernel (>= 4.8
).
However, as far as I know, you still need CAP_SYS_PTRACE
for ptrace
to actually work in containers.
Also, a custom seccomp
profile is still required here, because rr
also needs perf_event_open
(which is still blocked by Docker's default seccomp
profile).