image-spec
image-spec copied to clipboard
Proposal: Add Security Context
OCI Security Context
Summary
- The existing high-level container runtimes (e.g.,
containerd
) offer their default Seccomp profiles that are allowlists of system calls to make containers secure.- However, the runtime default profiles still include many system calls that actually are not used by containers because the profiles drop only potentially dangerous system calls.
- Recently, new system call analysis techniques have been proposed in research papers.
- By using the techniques, container image developers can generate more accurate default profiles for the container image than the runtime default profiles.
- In this issue, we propose defining the new
securitycontext
media type in the Image Media Types and addingSecurityContext
as a new field to theconfig
section of the Image Configuration.- The goal of this proposal is to allow the users to choose the image default security context including the default Seccomp profiles from the container orchestration software such as
Kubernetes
. - To use this feature from the existing orchestration software, we need to add a new setting like
ImageDefault
to the orchestration's configuration as extra work.
- The goal of this proposal is to allow the users to choose the image default security context including the default Seccomp profiles from the container orchestration software such as
- There is no formal definition for backward-compatible changes in this new feature.
Background
Containers offer weaker isolation than Virtual Machines because all containers running on the same host share the same OS kernel. Therefore, it is important to reduce the attack surface of the kernel used by containers. The attack surface can be reduced by Secure computing (Seccomp) that can restrict the system calls available to each container. Additionally, OS Capability and Mandatory Access Control (MAC) like SELinux and AppArmor provide defense in depth.
The existing high-level container runtimes such as containerd
and CRI-O
offer their default Seccomp profiles if the user sets them in a configuration of Kubernetes
as follows.
securityContext:
seccompProfile:
type: RuntimeDefault
The default Seccomp profiles are allowlists that drop potentially dangerous system calls such as pivot_root
, ptrace
, and etc.
Due to the default profiles, users can enforce Seccomp to containers easily without any analysis of system calls used by containers.
However, the profiles still include many system calls that actually are not used by the containers. If the users want to deny those system calls, they need to inspect the containers and identify system calls required for the containers using DockerSlim
[1] or other dynamic analysis tools [2] [3]. Unfortunately, the dynamic analysis tools are not perfect because they cannot catch workloads that are executed rarely, such as error handling routines. To identify system calls correctly, a static analysis strategy is necessary, but there are many challenges to inspect system calls inside containerized applications correctly.
Motivation
Recently, various state-of-the-art system call analysis techniques have been proposed in research papers to tackle the above issues. Typical examples include Confine
[4] and Sysfilter
[5].
Confine
is a new static analysis-based system for automatically extracting and enforcing system call policies on containers. Confine
inspects containerized applications and all their dependencies, identifies the superset of system calls required for the correct operation of the containers, and generates corresponding Seccomp system call policies that can be readily enforced while loading the containers. Compared to the existing system call analysis tools, Confine
can extract system calls more correctly by analyzing containers statically. The results of Confine
's evaluation by the authors with 150 publicly available Docker images show that Confine
can successfully reduce their attack surface by disabling 145 or more system calls for more than half of the containers, neutralizing 51 disclosed kernel vulnerabilities.
If container image developers can use Confine
or other new static analysis-based systems to extract system calls that are used by container images, they can generate more accurate default profiles for the container image than runtime default profiles. The image default profiles can drop more system calls in the containers, with other services and functionality disabled. As a result, attack surfaces are typically much smaller than they would be with general-purpose containers, so there are fewer opportunities to attack and compromise the containers.
Proposal
The goal of this proposal is to allow the users to choose the image default security context including the default Seccomp profiles and Capability setttings from the container orchestration software such as Kubernetes
. This proposal can make containers more secure and the user can save time and effort for the security configurations of the containers. To achieve this, we propose defining a security context media type in the OCI Image Media Types and adding a security context field to the OCI Image Configuration.
The reason for naming the media type securitycontext
is to allow security information such as Capability to be added in the future. Recently, various techniques that measure Linux container security have been proposed in research papers [6] [7]. If image developers can measure accurately Capabilities used by applications in container images leveraging those tools, they can set the default Capabilities to the image config. Considering this, we think it is better to add general security settings to the Image Configuration, not limited to Seccomp.
Each change is described below.
Image Media Type
We propose defining the new securitycontext
media type in the Image Media Types.
-
application/vnd.oci.image.securitycontext.v1+json
This contains information about security context that includes Seccomp and Linux Capability. We expect that the information is created by container image developers. For example, the image developer analyzes a container image in advance using system call analysis tools such as Confine
and writes the seccomp profiles into this securitycontext
JSON file.
The information is passed to each section in the OCI runtime specification by the high-level container runtimes. Hence, all the contents in the securitycontext
follow the runtime specification configurations.
Here is an example:
application/vnd.oci.image.securitycontext.v1+json
{
"seccomp": {
"defaultAction": "SCMP_ACT_ALLOW",
"architectures": [
"SCMP_ARCH_X86",
"SCMP_ARCH_X32"
],
"syscalls": [
{
"names": [
"swapoff",
"pivot_root",
...
],
"action": "SCMP_ACT_ERRNO"
}
]
},
"capabilites": {
"bounding": [
"CAP_AUDIT_WRITE",
"CAP_KILL",
"CAP_NET_BIND_SERVICE"
],
...
}
}
Image Configuration
We propose adding SecurityContext
as a new field to the config
section of the Image Configuration. This field points to a specific security context that includes information about security configurations. SecurityContext
includes a set of descriptor properties.
Here is an example:
application/vnd.oci.image.config.v1+json
"config": {
"User": "alice",
...
"SecurityContext": {
"mediaType": "application/vnd.oci.image.securitycontext.v1+json",
"size": 200,
"digest": "sha256:5b0bcabd1ed22e9fb1310cf6c2dec7cdef19f0ad69efa1f392e94a4333501270"
}
},
Expected Use Cases
User Side:
An example of Seccomp for Kubernetes
users is described below.
Set default Seccomp profiles for a container image.
spec:
securityContext:
seccompProfile:
type: ImageDefault
By the above configuration, Kubernetes
enforces the image default profiles to the container.
Image Developer Side:
An example for image developers is described below.
- Analyze a container image using system call analysis tools such as
Confine
. - Add information about Seccomp to the OCI image configuration
- Create a security context file in accordance with
application/vnd.oci.image.securitycontext.v1+json
and add the information to theSecurityContext
in the Image Configuration.
- Create a security context file in accordance with
- Push container image public or private registries.
Limitations
This default security context is just default settings for a container image that was analyzed by the image developer in advance. Therefore, if the user puts additional binaries into the default image, the user cannot use the default security context because it does not consider system calls used by the binaries.
Future Work
Currently, we have plans to develop a tool that allows image developers to easily analyze containerized applications inside an image using Confine
and create an OCI image configuration including the image default Seccomp profiles. We're also thinking about adding support for Kubernetes
to Confine
because the current implementation of Confine
can extract system calls from only Docker containers. Additionally, we need to add a new Seccomp type ImageDefault
in the security context of Kubernetes
and modify the high-level container runtimes such as containerd
to extract the Seccomp profiles from the Image Configuration when users choose the image default Seccomp profiles.
Backward Compatibility
There is no formal definition for backward-compatible changes in this new feature.
References
[1] DockerSlim. https://dockersl.im [2] strace. https://strace.io [3] oci-seccomp-bpf-hook. https://github.com/containers/oci-seccomp-bpf-hook [4] Seyedhamed Ghavamnia, Tapti Palit, Azzedine Benameur, and Michalis Polychronakis. Confine: Automated System Call Policy Generation for Container Attack Surface Reduction. In International Symposium on Research in Attacks, Intrusions and Defenses (RAID), 2020. [5] Nicholas DeMarinis and Kent Williams-King and Di Jin and Rodrigo Fonseca and Vasileios P. Kemerlis. sysfilter: Automated System Call Filtering for Commodity Software. In International Symposium on Research in Attacks, Intrusions and Defenses (RAID), 2020. [6] J. Criswell, J. Zhou, S. Gravani and X. Hu. "PrivAnalyzer: Measuring the Efficacy of Linux Privilege Use," 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2019. [7] Xin Lin, Lingguang Lei, Yuewu Wang, Jiwu Jing, Kun Sun, and Quan Zhou. A measurement study on Linux container security: Attacks and countermeasures. In Proceedings of the 34th Annual Computer Security Applications Conference (ACSAC), 2018.
Thanks for the proposal!
Some general questions below:
- Is this new
securityContext
image media type a reflection of container-level K8s security context: https://kubernetes.io/docs/tasks/configure-pod-container/security-context/? Say a subset, full set or super set of it for example? - Why this abstraction has to bind with images/image-spec? An image is possible to run with different security contexts. And we have to anyway follow the runtime spec configs for this
image.securitycontext
. Letting a higher level to handle this looks fine (which can still leverage the tools likeConfine
etc. in their DevOps pipeline). What's the benefit for a image default one? - How is the conflict handled if specified differently in runtime-spec, CRI runtime or elsewhere?
- Tools like
Confine
seems to only work for syscall analysis. What about other properties in the security context? Looks like a combination of utilities is thus needed? - Any burden brought to developers? Though the media type can be optional, it brings no benefit then. I wonder how demanding/practical is this feature if without mature tooling as asked in 3).
@kailun-qin I apologize for the late reply. Thank you for your valuable comments and questions!
-
Yes, this new
securityContext
is a subset of the container-level K8s security context or runtime-spec. The new media type allows users to set a default security context that is more secure than high-level container runtimes' default seccomp profiles. -
The main benefit for image-spec is that users can apply the default security context to their containers transparently without burden. The
image.securitycontext
is a default security context of an image that is created by the image developer. To use the default security context transparently from users, the security context should be image-spec because the high-level container runtimes such ascontainerd
creates runtime-spec config file (config.json
) based on the image-spec and K8s config. If we let a higher level such as DevOps pipeline to handle this, we have to run analysis tools likeConfine
to extract default seccomp profiles and apply them to containers by ourselves. This is good for users who modify the existing images or manage their images in their registry. However, it is tiring for users who use the images in the public registry without modifying them. Therefore, theimage.securitycontext
should be created by the image developers from the perspective of demarcation of responsibility and stored in the image spec to be able to extract it from high-level runtimes. -
This is a default configuration, so if users have already set the security configuration in CRI runtime, etc., this
image.securitycontext
should be overwritten. -
Yes,
Confine
works only syscall analysis, so if image developers want to set default Capabilities, they have to use other tools. As I mentioned above, recently, various techniques that measure Linux container security have been proposed in research papers [6] [7]. If image developers can measure accurately the Capabilities used by applications in container images leveraging those techniques, they can set the default Capabilities to the image config. Therefore, I'd like to add seccomp profiles but also Capabilities in this proposal for the future. -
The burden for image developers is just to run the tool such as
Confine
and store the information in the image spec. As you said, without mature tools such asConfine
, this feature will not be useful. However, as of now,Confine
is the most practical tool to meet the requirements for this new feature and we have confirmed that it works properly though we need to apply a few patches toConfine
to be able to run on the newer kernel version.
Thank you.