janitor icon indicating copy to clipboard operation
janitor copied to clipboard

Enable debugger support for containers

Open ishitatsuyuki opened this issue 6 years ago • 18 comments

(untested)

ishitatsuyuki avatar Apr 03 '18 13:04 ishitatsuyuki

Thanks a lot for working on enabling rr in Janitor containers!

@jld could you please confirm that adding CAP_SYS_PTRACE and using seccomp=unconfined on all our Docker containers isn't too risky from a security standpoint? (Taking into account that our users are sudo in their containers, but we don't want them to be root in our Docker servers.)

@rocallahan does adding CAP_SYS_PTRACE and using seccomp=unconfined seem correct for allowing rr to work inside Docker containers? (Assuming that /proc/sys/kernel/perf_event_paranoid is already 1 on the server.)

jankeromnes avatar Apr 03 '18 15:04 jankeromnes

does adding CAP_SYS_PTRACE and using seccomp=unconfined seem correct for allowing rr to work inside Docker containers?

Yes.

rocallahan avatar Apr 03 '18 19:04 rocallahan

Before we move forward with this, I'd like to be certain that adding CAP_SYS_PTRACE and using seccomp=unconfined on all Janitor's Docker containers isn't too risky from a security standpoint.

Ping @jld ?

jankeromnes avatar Apr 06 '18 17:04 jankeromnes

I'd really like to find a solution where we only enable the syscalls actually needed for rr, instead of all 44 potentially dangerous syscalls that are blocked by default.

The default seccomp profile includes these lines:

		{
			"names": [
				"kcmp",
				"process_vm_readv",
				"process_vm_writev",
				"ptrace"
			],
			"action": "SCMP_ACT_ALLOW",
			"args": [],
			"comment": "",
			"includes": {
				"caps": [
					"CAP_SYS_PTRACE"
				]
			},
			"excludes": {}
		}

Am I interpreting these correctly if I conclude that adding CAP_SYS_PTRACE will also make seccomp allow the syscalls ptrace, kcmp, process_vm_readv and process_vm_writev? EDIT: No, these syscalls are blocked by both seccomp and the dropped SYS_PTRACE.

@Martiusweb speculates that rr only needs ptrace, perf_event_open and maybe also userfaultfd, and he suggests that we try enabling these first, and iterate on any missing syscalls afterwards.

My preferred outcome would be to fork the default seccomp profile to make rr work, and propose our changes upstream or rebase our fork from time to time.

jankeromnes avatar Apr 27 '18 06:04 jankeromnes

@Martiusweb speculates that rr only needs ptrace, perf_event_open and maybe also userfaultfd, and he suggests that we try enabling these first, and iterate on any missing syscalls afterwards.

That should work.

We don't need userfaultfd.

rocallahan avatar Apr 27 '18 06:04 rocallahan

So here are the syscalls used by rr during a very simple record & replay session, and their seccomp specificities if any:

access: (allowed)
arch_prctl: "arches": [ "amd64", "x32" ]
bind: (allowed)
brk: (allowed)
chmod: (allowed)
clone: "caps": [ "CAP_SYS_ADMIN" ]
close: (allowed)
execve: (allowed)
exit_group: (allowed)
fcntl: (allowed)
fstat: (allowed)
ftruncate: (allowed)
futex: (allowed)
getrlimit: (allowed)
getsockopt: (allowed)
getxattr: (allowed)
ioctl: (allowed)
kill: (allowed)
link: (allowed)
listen: (allowed)
lseek: (allowed)
lstat: (allowed)
mkdir: (allowed)
mmap: (allowed)
mprotect: (allowed)
munmap: (allowed)
open: (allowed)
perf_event_open: "caps": [ "CAP_SYS_ADMIN" ]
pipe2: (allowed)
poll: (allowed)
prctl: (allowed)
pread64: (allowed)
ptrace: "caps": [ "CAP_SYS_PTRACE" ]
pwrite64: (allowed)
read: (allowed)
readlink: (allowed)
recvmsg: (allowed)
rename: (allowed)
rt_sigaction: (allowed)
rt_sigprocmask: (allowed)
rt_sigreturn: (allowed)
sched_setaffinity: (allowed)
set_robust_list: (allowed)
set_tid_address: (allowed)
setitimer: (allowed)
setrlimit: (allowed)
socket: (allowed)
stat: (allowed)
statfs: (allowed)
symlink: (allowed)
syscall_446: ???
uname: (allowed)
unlink: (allowed)
wait4: (allowed)
write: (allowed)

So I conclude that to make rr work inside Docker containers, we need to enable the following 3 syscalls (annotated by the reason Docker blocks it by default):

  • clone: Deny cloning new namespaces. Also gated by CAP_SYS_ADMIN for CLONE_* flags, except CLONE_USERNS.
  • perf_event_open: Tracing/profiling syscall, which could leak a lot of information on the host.
  • ptrace: Tracing/profiling syscall, which could leak a lot of information on the host. Already blocked by dropping CAP_PTRACE.

(I have no idea what syscall_446 is all about.)

jankeromnes avatar Apr 27 '18 06:04 jankeromnes

I don't think you need to worry about clone. rr does not create new namespaces (though it supports recording children that do). It uses clone to create processes and threads, but I'm sure your seccomp policy would allow that.

Don't worry about 446 (and rr's other internal syscalls, 442 to 447). rr uses them internally and I think blocking them via seccomp will not affect rr's usage.

rocallahan avatar Apr 27 '18 07:04 rocallahan

Though you might need to whitelist them.

rocallahan avatar Apr 27 '18 07:04 rocallahan

Awesome, thanks @rocallahan for clarifying!

I think we can reasonably tweak Docker's default seccomp profile to authorize perf_event_open and ptrace, in addition to adding CAP_SYS_PTRACE.

jankeromnes avatar Apr 27 '18 07:04 jankeromnes

There is a large obstacle here: passing custom seccomp profile requires a file on the Docker host, but currently we operate remotely without an agent.

ishitatsuyuki avatar Apr 27 '18 08:04 ishitatsuyuki

There is a large obstacle here: passing custom seccomp profile requires a file on the Docker host, but currently we operate remotely without an agent.

Well, we do have a Janitor clone on each Docker host, which we use to properly configure Docker (and also run an authenticated Docker proxy). Maybe we can commit a custom seccomp profile somewhere in the Janitor repository, update our host's checkouts, and point our Docker daemons to that file?

jankeromnes avatar Apr 27 '18 08:04 jankeromnes

@jankeromnes has there been any movement on this? Having rr support would be fantastic, but what I currently really need is the ability to set breakpoints in gdb, which unfortunately also requires ptrace and at least some seccomp changes.

tschneidereit avatar Aug 26 '18 14:08 tschneidereit

@tschneidereit Hello!

We can already add CAP_SYS_PTRACE as a first step.

Unfortunately, we're still blocked on creating a fork of Docker's default seccomp profile which just enables these two additional syscalls:

  • perf_event_open
  • ptrace

Note: This pull request would indeed allow rr to work (manually tested with @whimboo), but I believe that using seccomp=unconfined (i.e. completely disabling seccomp) is too risky from a security standpoint. (But maybe I'm wrong, e.g. I believe that Kubernetes disables Docker's seccomp altogether, or did at some point.)

jankeromnes avatar Aug 27 '18 13:08 jankeromnes

hi @jankeromnes , this is the most relevant resource on the nets on how safe SYS_PTRACE is for containers, or so it seems.

Kubernetes indeed allows to manually allow / apply any capabilities to either profiles or directly to pods/containers, bypassing naughty Docker.

But, is there a definitive answer on the safety of allowing SYS_PTRACE for all containers of a cluster? Are there any security risks? Also, is there any difference on CAP_SYS_PTRACE and SYS_PTRACE?

nikopen avatar Nov 14 '18 13:11 nikopen

Hi @nikopen,

this is the most relevant resource on the nets on how safe SYS_PTRACE is for containers, or so it seems.

Hm, that's quite unfortunate, as it seems nobody here really understands the security implications of enabling these capabilities and syscalls in third-party controlled containers on shared infrastructure.

is there any difference on CAP_SYS_PTRACE and SYS_PTRACE?

No, I believe these are the same thing (i.e. the capability or unit of Linux superuser privileges called "PTRACE").

Kubernetes indeed allows to manually allow / apply any capabilities to either profiles or directly to pods/containers, bypassing naughty Docker.

Thank you for this information. I guess this means that Kubernetes disables Docker's seccomp because it already has its own permission checking mechanism (presumably with similar "safe defaults" regarding allowed or forbidden capabilities, in which case I'd love to know if they allow the CAP_SYS_PTRACE capability or the perf_event_open and ptrace syscalls by default).

But, is there a definitive answer on the safety of allowing SYS_PTRACE for all containers of a cluster? Are there any security risks?

Unfortunately, we don't have a definitive answer for this. From my (limited) understanding, it seems that in the past, someone made a proof of concept using the PTRACE capability to "escape" a Docker container. I also believe (without proof) that this vulnerability has since been fixed, because the security of Docker's unprivileged containers has been improved, and I'm not aware of any recent PTRACE-related container escapes. However, I could be awfully wrong here, and I'd love to hear from someone more knowledgeable about this.

jankeromnes avatar Nov 14 '18 15:11 jankeromnes

Thank you for this information. I guess this means that Kubernetes disables Docker's seccomp because it already has its own permission checking mechanism (presumably with similar "safe defaults" regarding allowed or forbidden capabilities, in which case I'd love to know if they allow the CAP_SYS_PTRACE capability or the perf_event_open and ptrace syscalls by default).

Indeed, k8s disables Docker's seccomp, and it can be re-enabled via a specific annotation. The default ServiceAccount on an unconfigured 'vanilla' k8s installation is very open - allows pretty much everything, and then granular capabilities can be defined either via securityContext in a pod/deployment or via PodSecurityPolicies, to have unprivileged pods but allowing SYS_PTRACE et cetera.

Unfortunately, we don't have a definitive answer for this. From my (limited) understanding, it seems that in the past, someone made a proof of concept using the PTRACE capability to "escape" a Docker container. I also believe (without proof) that this vulnerability has since been fixed, because the security of Docker's unprivileged containers has been improved, and I'm not aware of any recent PTRACE-related container escapes. However, I could be awfully wrong here, and I'd love to hear from someone more knowledgeable about this.

That's interesting, is it related with this commit? https://github.com/torvalds/linux/commit/58d0a862f573c3354fa912603ef5a4db188774e7 I can see it entered the 4.8 kernel. I could mail the k8s security team to learn more about this domain, maybe they have something easily reachable to share.

nikopen avatar Nov 14 '18 16:11 nikopen

That's interesting, is it related with this commit? torvalds/linux@58d0a86

As the commit message says, in the past a process could use ptrace() to negate any seccomp() policy imposed on it, by changing the syscall number after the seccomp policy had run. Therefore if you imposed any seccomp policy at all, you had to block ptrace() as well. That particular issue is fixed since 4.8 so that is no longer a reason to block ptrace() on those kernels.

Another reason to block ptrace() is that it's just a really complicated API which provides a considerable amount of attack surface through which an attacker might be able to exploit zero-day kernel bugs. It makes sense that if your containers don't need ptrace(), you should block it. This will always be true I guess.

I don't know of any bugs or features of ptrace() that enable container escapes or other security policy violations. I think if any were known, they would be treated as security bugs and fixed. At this point I think it should be treated like any other API that is thought to be safe but adds significant attack surface.

rocallahan avatar Nov 14 '18 21:11 rocallahan

Update:

Docker's default seccomp profile now seems to authorize the ptrace syscall if kernel >= 4.8:

{
	"names": [
		"ptrace"
	],
	"action": "SCMP_ACT_ALLOW",
	"args": null,
	"comment": "",
	"includes": {
		"minKernel": "4.8"
	},
	"excludes": {}
},

(Source.)

If my interpretation is correct, this means that in order to allow the ptrace syscall in Docker containers, you no longer need a custom seccomp profile -- just a recent Docker (that has https://github.com/moby/moby/commit/1124543ca8071074a537a15db251af46a5189907) and Linux kernel (>= 4.8).

However, as far as I know, you still need CAP_SYS_PTRACE for ptrace to actually work in containers.

Also, a custom seccomp profile is still required here, because rr also needs perf_event_open (which is still blocked by Docker's default seccomp profile).

jankeromnes avatar May 06 '19 15:05 jankeromnes