runtime-spec icon indicating copy to clipboard operation
runtime-spec copied to clipboard

seccomp: Allow specification of syscalls as numbers

Open Keno opened this issue 3 years ago • 4 comments

The motivation here is the same as https://github.com/moby/moby/issues/41671, which was closed as requiring a spec change (hence this PR). In short, certain applications use high syscall numbers (e.g. 1000+) for private communication with an associated ptracer. Since these are not real syscalls, there is no corresponding string mapping for them in the runtime. Currently users simply run such applications in privileged containers, which is of course the absolute worst option. I would like to be able to provide these users a seccomp profile that works, but without being able to specify these pseudo-syscalls by number, this is not possible.

cc @thaJeztah @staticfloat

Keno avatar Apr 10 '21 00:04 Keno

@Keno looks like your commit message is missing a DCO sign-off; https://github.com/opencontainers/runtime-spec/pull/1102/checks?check_run_id=2310376370

thaJeztah avatar Apr 15 '21 08:04 thaJeztah

So, generally, I understand and am "positive" on the idea but the devil may be in the details. Let me write down some thoughts

1. architecture-dependent

One reason for (currently) using named syscalls is that conversion of a syscall to its numeric value must take into account what platform and architecture the container will be deployed on, because numeric values will differ:

Picking a random syscall (epoll_create), and looking at amd64, arm, s390x, and mips64le

SYS_EPOLL_CREATE = 213  // amd64
SYS_EPOLL_CREATE = 250  // arm
SYS_EPOLL_CREATE = 249  // s390x
SYS_EPOLL_CREATE = 5207 // mips64le

This conversion is (I think) currently handled by libseccomp, and as such will always match the actual architecture; https://github.com/seccomp/libseccomp-golang/blob/febb5ff482c15f8cafd1f7269f3a5eba047f9a5f/seccomp.go#L452

result := C.seccomp_syscall_resolve_name_arch(arch.toNative(), cString)

But pushing this "higher up the stack" to higher level runtimes where the profile is generated, it means that those runtimes must be aware of this conversion and might complicate things if (e.g.) the profile is generated on an orchestrator in a mixed-architecture cluster.

So taking the above into account, named values would be more "portable".

2. description should include accepted format for numeric values

Given that the numeric values should be passed as a string, it's important to describe what formats are accepted. The value will have to be converted from a string to an integer, and things can go wrong in this translation if it's not explicitly specified (marshalling/unmarshalling JSON can differ between implementations).

Thinking of strings that could (either intentionally or unintentionally) be regarded numeric values (hexadecimal, octal, etc.): "OxFF", "020", "DEADBEEF"

3. description should be non-ambiguous

This repository is a specification we should make absolutely sure that there's no ambiguity in how runtimes should handle these. For a specification, the goal is to describe exactly how things should be handled, and "explicit > implicit" is important.

Given that the existing ("named") syscalls are already widely in use, we'll have to deal with a transition (some runtimes may, and some runtimes may not support numeric). The spec must describe how runtimes should handle this.

Some options:

A. make support for "numeric" values OPTIONAL?

In this case, runtimes MUST / SHOULD log "unknown" syscalls as a warning, and proceed without them

This approach makes the transition easier: "higher level" runtimes can pass both named and numeric values, and depending on what "lower level" runtime is used, those numeric values would be ignored or not.

But this needs further detailing:

  • Seccomp profiles can be an "allow-list" or a "deny-list": ignoring unknown syscalls can therefore reduce security of the container (a syscall that should be blocked is now "allowed")
  • Are "mixed" (named and numeric) values allowed?
  • How should duplicates be handled? (["epoll_create", "213"])
    • pick only the numeric ones (if supported), and pick "named" ones if not supported?
    • pick both (later overrides former)
    • what if there's conflicting options (different options set for "named" and "numeric")?

B. make support for "numeric" values REQUIRED?

In this case, runtimes MUST produce an error, and reject the spec if they don't support numeric values

I think this would be the current behavior; any current runtime would not have the numeric values in their list, and would currently produce an error.

  • This is a safer option; unknown syscalls are not silently ignored, reducing the risk of "not" blocking syscalls that should be blocked
  • But it makes specifying profiles more complicated: higher level runtimes now need to know if the "lower" runtime does, or does not support numeric values

in a "mixed kernel version" cluster, syscalls that are not supported by the kernel on a node would result in a failure. To an extend, this situation can be (edit: apparently I was writing something here, but didn't include it; ignore for now (but perhaps I find it back 😅)

C. Separate fields for "numeric" and "named" values

Separate fields may reduce ambiguity, but I'd have to give this more thinking:

  • (Possibly) would allow (require?) numeric values to be passed per-architecture
  • Can remove the "duplicate entries" ambiguity, depending on how the new field should be treated, which could be:
    • "runtimes that support the new (numeric) field probably MUST pick (one of the fields), and ignore the other one"

However transition could be complicated:

  • runtimes that do not support the new (numeric) fields quite likely would "ignore" it, and as such would fully skip the seccomp profile, which is (again) likely undesirable.
  • during the transition, "higher level runtimes" MUST/SHOULD include both "named" and "numeric" variants in the profile to make sure the profile can be used.

thaJeztah avatar Apr 15 '21 11:04 thaJeztah

@thaJeztah

I agree with your point (1) though the usecase mentioned in this issue is specific to "synthetic" syscalls which have syscall numbers the kernel doesn't use at all (and thus kernel weirdness with syscall numbers don't apply).

I completely agree with point (2).

However with point (3) I think you're missing one of the key behaviours that runtimes have today -- we ignore unknown syscalls. This means that current behaviour is (3A) and in fact (3B) would be both a backwards compatibility breakage (we cannot add REQUIRED to existing fields in a non-major version) and it would make runc not spec-compliant.

(As an aside, when it comes to the fact this makes deny-lists insecure -- deny-lists are fundamentally insecure, so it's more important to make container configurations work on different kernel and libseccomp versions than it is to make deny-lists theoretically more secure.)

cyphar avatar Apr 20 '21 11:04 cyphar

Are we fine to accept this with (3A)?

cc @kolyshkin

AkihiroSuda avatar Sep 01 '22 16:09 AkihiroSuda