nerdctl
nerdctl copied to clipboard
nerdctl run with --cap-add NET_BIND_SERVICE not working
Description
I have several linuxserver-based containers whose unprivileged services bind to port 80 inside the container, so I can access them through a VPN without having to add port numbers to my URL's. This setup has been working without issue on docker.
Now I'm moving to containerd (docker support is being dropped on truenas scale) and most of my containers fail to bind to port 80.
I modified my run
commands to use --cap-add NET_BIND_SERVICE
as instructed in the containerd github page, but the containers still fail to bind.
I can use docker inspect
on the old containers to confirm that NET_BIND_SERVICE is present, but nerdctl inspect
does not return any CapAdd field.
Steps to reproduce the issue
- Configure a container with an unprivileged service that it runs on port 80 internally
- Launch the container using
nerdctl run --cap-add NET_BIND_SERVICE
- Watch the initialization logs of the container
Describe the results you received and expected
I expected the unprivileged service to bind to port 80 / 443, but it doesn't.
What version of nerdctl are you using?
1.5.0
Are you using a variant of nerdctl? (e.g., Rancher Desktop)
None
Host information
Client:
Namespace: default
Debug Mode: false
Server:
Server Version: 1.6.8
Storage Driver: overlayfs
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Log: fluentd journald json-file syslog
Storage: native overlayfs zfs
Security Options:
apparmor
seccomp
Profile: default
cgroupns
Kernel Version: 5.15.107+truenas
Operating System: Debian GNU/Linux 11 (bullseye)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.41GiB
HI @Caian
The nerdctl cap can work correctly in the environment.
root@kay201:~# nerdctl run --cap-add NET_BIND_SERVICE,CHOWN,DAC_OVERRIDE,SETGID,SETUID --cap-drop ALL -d --name haha -p 80:80 docker.m.daocloud.io/nginx:alpine
0b016f6b89ec031c880fcc0c6aaf5deb6538dd736aa94381218a85adc3defe11
root@kay201:~# nerdctl ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
0b016f6b89ec docker.m.daocloud.io/nginx:alpine "/docker-entrypoint.…" 4 seconds ago Up 0.0.0.0:80->80/tcp haha
root@kay201:~# curl 127.0.0.1:80
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
html { color-scheme: light dark; }
body { width: 35em; margin: 0 auto;
font-family: Tahoma, Verdana, Arial, sans-serif; }
And the nerdctl inspect
cannot show the cap information, but the ctr c info
can.
Would you please tell us more detail about the steps to reproduce the issue :-)
==> nerdctl inspect --mode=native
I modified my
run
commands to use--cap-add NET_BIND_SERVICE
as instructed in the containerd github page, but the containers still fail to bind.
I expected the unprivileged service to bind to port 80 / 443, but it doesn't.
Since there's no feedback with more details, it appears @Caian did not add the capability when using Docker prior (since it was not necessary).
- With Docker, the sysctl
net.ipv4.ip_unprivileged_port_start
(default1024
) is dropped to0
- allowing any unprivileged process to bind the typical privileged ports without being grantedCAP_NET_BIND_SERVICE
. - With their
containerd
attempt, I assume that was the default1024
and they've only granted the capability to root which their image is not running the binding process with (where the capability will not be in the effective set).
Solutions
- Set the sysctl option
--sysctl net.ipv4.ip_unprivileged_port_start=0
to bypass the need for the capability (all processes within the container can then bind to these ports, similar to if ambient capabilities were supported). - If the services link to
libc
and are notscratch
/distroless
like images, they could probably useauthbind
(useful when the service that binds is script based like Python, JS, shell, etc). - Use
setcap cap_net_bind_service=ep file_name
to grant the capability to Permitted and Effective sets on the executable (useful for software built as a static binary withoutlibc
, common with Rust/Go). This is considered a "capability-dumb" approach when there is no control for the software to be capability aware. The drawback is the kernel enforces a check for the permitted capability being effective for the process before the executable runs, even when the program runs without actually needing the capability (a user binds to an unprivileged port, dropping all capabilities as a security measure). - Use
setcap cap_net_bind_service=p file_name
when the program is capable of observing it's Permitted set and raising the needed capability into the Effective set. This is ideal when Ambient capabilities cannot be used (commonly not supported within containers, nor do you necessarily want to grant ambient capabilities process-wide). - Run as root with all capabilities dropped except for those needed. Similar to the potential for Ambient support, this is less viable with most base images, but may be acceptable for
scratch
or certaindistroless
variants. The majority of container vulnerabilities that motivate users to adopt a non-root user are reliant upon adequate capabilities being granted, which can still be exploited from a non-root user 🤷♂️
NOTE: The setcap
approach for file-based capabilities:
- Will remove
LD_PRELOAD
andLD_LIBRARY_PATH
environment variables on binaries linked tolibc
(verify withldd file_name
), which depending on the software may introduce a regression. - On some systems (like a Synology NAS)
setcap
is not able to be used in an image build, likely due to AUFS + kernel). - User-namespaced containers require kernel 4.14
Ambient capabilities requires at least kernel 4.3, and the sysctl requires at least kernel 4.11.
I modified my
run
commands to use--cap-add NET_BIND_SERVICE
as instructed in the containerd github page, but the containers still fail to bind.I expected the unprivileged service to bind to port 80 / 443, but it doesn't.
Since there's no feedback with more details, it appears @Caian did not add the capability when using Docker prior (since it was not necessary).
* With Docker, the sysctl `net.ipv4.ip_unprivileged_port_start` (default `1024`) is dropped to `0` - allowing any unprivileged process to bind the typical privileged ports without being granted `CAP_NET_BIND_SERVICE`. * With their `containerd` attempt, I assume that was the default `1024` and they've only granted the capability to root which their image is not running the binding process with (_where the capability will not be in the effective set_).
Solutions
* Set the sysctl option `--sysctl net.ipv4.ip_unprivileged_port_start=0` to bypass the need for the capability (_all processes within the container can then bind to these ports, similar to if ambient capabilities were supported_). * If the services link to `libc` and are not `scratch` / `distroless` like images, they could probably [use `authbind`](https://stackoverflow.com/questions/413807/is-there-a-way-for-non-root-processes-to-bind-to-privileged-ports-on-linux/27989419#27989419) (_useful when the service that binds is script based like Python, JS, shell, etc_). * Use `setcap cap_net_bind_service=ep file_name` to grant the capability to Permitted and Effective sets on the executable (_useful for software built as a static binary without `libc`, common with Rust/Go_). This is considered a ["capability-dumb"](https://man7.org/linux/man-pages/man7/capabilities.7.html) approach when there is no control for the software to be capability aware. The drawback is the kernel enforces a check for the permitted capability being effective for the process before the executable runs, even when the program runs without actually needing the capability (_a user binds to an unprivileged port, dropping all capabilities as a security measure_). * Use `setcap cap_net_bind_service=p file_name` when the program is capable of observing it's Permitted set and raising the needed capability into the Effective set. This is ideal when Ambient capabilities cannot be used (_commonly not supported within containers, nor do you necessarily want to grant ambient capabilities process-wide_). * Run as root with all capabilities dropped except for those needed. Similar to the potential for Ambient support, this is less viable with most base images, but may be acceptable for `scratch` or certain `distroless` variants. The majority of container vulnerabilities that motivate users to adopt a non-root user are reliant upon adequate capabilities being granted, which can still be exploited from a non-root user 🤷♂️
NOTE: The
setcap
approach for file-based capabilities:* Will remove `LD_PRELOAD` and `LD_LIBRARY_PATH` environment variables on binaries linked to `libc` (_verify with `ldd file_name`_), which depending on the software may introduce a regression. * On some systems (_like a [Synology NAS](https://github.com/caddyserver/caddy-docker/issues/290#issuecomment-1504845336)) `setcap` is not able to be used in an image build, likely due to AUFS + kernel_).
Sorry, I forgot to answer to the thread. Yes, I ended up adding --sysctl net.ipv4.ip_unprivileged_port_start=0
instead of using --cap-add
, which solved the issue.
@AkihiroSuda there is no bug here, we should close.
@Caian was able to get what they want with ip_unprivileged_port_start
(which is the docker behavior) and @polarathene provided great details about why just using the cap on a random image will not work.