Allow for per-user network namespaces
Component
No response
Is your feature request related to a problem? Please describe
I'm looking for a way to let users run podman containers to only bind on certain ports or addresses to ensure that no conflicts arise from different users trying to bind a service to the same interface/port combination.
Without control over either the ports that users may bind or the interfaces available to them, splitting a rootless podman setup into multiple user accounts provides limited compartmentalization or security benefit as one users service can "steal" another users services IP port.
Describe the solution you'd like
Introduce a systemd setting similar to those in systemd.resource-control that sets up network namespaces for user sessions
Describe alternatives you've considered
I've come to realize that the SocketBindAllow/-Deny rules from systemd.resource-control also block bind calls within users rootless podman containers making it unsuited for such a setup (I'm unsure whether to open a bug report for this).
As an alternative approach, I'm evaluating to restrict users into a network namespace however unlike resource control settings such as SocketBindAllow/-Deny, systemd's NetworkNamespacePath setting seems only applicable to service units and not to users [see comment below].
I have tried using pam_usernet.so from libpam-net but enabling this pam module in debian's /etc/pam.d/common-session-noninteractive or fedoras /etc/pam.d/system-auth breaks respective users sessions:
# systemctl start user@test
Job for [email protected] failed because the control process exited with error code.
See "systemctl status [email protected]" and "journalctl -xeu [email protected]" for details.
systemd[1]: Created slice user-test.slice - User Slice of UID test.
systemd[1]: Starting [email protected] - User Runtime Directory /run/user/test...
systemd[1]: Finished [email protected] - User Runtime Directory /run/user/test.
systemd[1]: Starting [email protected] - User Manager for UID test...
(systemd)[1534]: pam_unix(systemd-user:session): session opened for user test(uid=1000) by (uid=0)
systemd[1534]: Cannot determine cgroup we are running in: No medium found
systemd[1534]: Failed to allocate manager object: No medium found
systemd[1]: [email protected]: Main process exited, code=exited, status=1/FAILURE
systemd[1]: [email protected]: Failed with result 'exit-code'.
systemd[1]: Failed to start [email protected] - User Manager for UID test.
systemd[1]: Stopping [email protected] - User Runtime Directory /run/user/test...
systemd[1]: run-user-1000.mount: Deactivated successfully.
systemd[1]: [email protected]: Deactivated successfully.
systemd[1]: Stopped [email protected] - User Runtime Directory /run/user/test.
systemd[1]: Removed slice user-test.slice - User Slice of UID test.
I'm unsure whether to open a bug report for this and whether with systemd or rather with libpam-net.
The systemd version you checked that didn't have the feature you are asking for
254
I've come to realize that user managers are in fact the template service user@ and therefore I can set a users user units into a network namespace by creating a configuration file such as /etc/systemd/system/[email protected]/20-netns.conf:
[Service]
NetworkNamespacePath=/run/netns/unns%i
[Unit]
After=unns@%i.service
BindsTo=unns@%i.service
This causes any user services of uid 1000 to run in the unns1000 network namespace. Other sessions however (e.g. an ssh session) will still run in the default network namespace. I will try to set up pam to use pam_usernet.so to set up the netns for any means of starting a session that is not systemd but a better solution would be either systemd setting up the netns in any case or systemd working with pam_usernet.so in the user managers pam stack.
unns@%i.service sets up the network namespace
/etc/systemd/system/[email protected]
[Unit]
Description=Create network namespace unns%i
[Service]
Type=simple
RemainAfterExit=yes
ExecStart=ip netns add unns%i
ExecStop=ip netns delete unns%i
Hmm, so if I read this correctly, you want PrivateNetwork=yes with a specific name?
Hello @YHNdnzj thank you for replying.
It seems that NetworkNamespacePath already is PrivateNetwork with a specific name. My issue is that neither of them are applicable to users/slices, only to services. Setting NetworkNamespacePath in the [email protected] configuration applies the namespace only to services in user scope yet not to other sessions using pam_systemd.so.
Making NetworkNamespacePath/PrivateNetwork a part of systemd.resource-control or similar so it is applicable to users is my feature request.
As systemd does not feature setting network namespaces for users, I set them using a pam module. Sadly, systemd is unable to start the user session if the session is initiated with its on namespace -- that is at least with when pam_usernet.so from libpam-net is called from pam.d/systemd-user. I'm sceptical whether this is a systemd issue really as libpam-net remounts /sys/ resulting in an empty /sys/fs/cgroup/
I a not sure I follow? are you suggesting to run the whole user session inside a userns? and how shall it talk to the network then?
And what does that have to do with port assignment?
note that ebpf provides everything to allow restricting which ports/ip addresses code can bind to. We expose that in IPAddressAllow=/IPAddressDeny= and SocketBindAllow=/SocketBindDeny=. But that has nothing to do with netns, and just works like that?
anyway, i think i am lacking context here.
I'm setting up whole user sessions to run in separate network namespaces after IPAddressAllow=/IPAddressDeny=, SocketBindAllow=/SocketBindDeny= did not work out for me.
In this approach, users access the network via a veth pair to the default network namespace and adequate routing set up. It works quite well when I initiate such sessions via systemd-run --user -S with [email protected] configured to use NetworkNamespacePath= in a manner similar to above.
Indeed those ebpf bind restrictions are a different matter but I've tried solving my problem with them first. It turned out they don't work well with podman containers as the eBPF filters reject bind calls even within containers as the filters don't discern namespaces (the SocketBind* filters really only look whether its v4 or v6 and then match the port number). I'm unsure whether this is intended behaviour. I can open a detailed separate bug report on the issues they cause with containers.
To illustrate my issue with SocketBindAllow=/SocketBindDeny=; This service fails with nc: Operation not permitted. I'll create a separate issue for this because the more I think about it the more I believe separate namespaces should be exempt from SocketBind limitations.
[Unit]
Description=test SocketBindDeny
[Service]
ExecStart=nc -l 1234
SocketBindAllow=4321
SocketBindDeny=any
PrivateNetwork=yes
i still don't grok what you really want to do here.
What I want to do personally
I want to do multi-tenant hosting of (not only) userspace containers.
Linux really falls short on multi-tenant network permissions. {IPAddress,SocketBind}{Allow,Deny} are a direly needed features but they are not compatible with containers (at least rootless podman containers) at this point, making them unsuitable for my project.
So I decided instead, I'd assign each user their own network namespace, each with one veth network interface that is a point-to-point connection to the servers root network namespace.
Why I want to do it
This solution, besides being functional today and avoiding all conflicts on network resources, is more versatile than the alternatives like SocketBind* or the oldschool "authbind". With the per-user netns, I can set up any routing, forwarding, filtering... I can assign public IPs to users. users have the most straightforward interface possible: Just a single network interface where they can do what they want. Very satisfying!
I set net.ipv4.ip_unprivileged_port_start=32768 in the default netns and net.ipv4.ip_unprivileged_port_start=0 in each of the user network namespaces. Now to enable users to use their private veth, their sessions shall be launched in this netns. Launching their --user services in the netns works perfectly with a central NetworkNamespacePath=/run/netns/unns%i (%i being the uid).
But I want to apply this user netns to the users ssh sessions too. They can have a nice session in their netns using systemd-run --user -S but I'd want them "ssh into their netns". I want their netns to be the only environment they see. (Although breaking out of the netns would not be a security issue)
Systemd as of today only lets me apply NetworkNamespacePath to the users service manager and services but the alternative, libpam-net behaves subtly different than NetworkNamespacePath. Unlike systemd/NetworkNamespacePath libpam-net (like ip netns exec) remounts /sys and they won't maintain /sys/fs/cgroup/. This seems to be what breaks systemd user sessions when trying to use libpam-net instead of NetworkNamespacePath and it makes using them together (enabling libpam-net for pam sessions other than systemd-user) tricky and inconsistent.
What change I propose
My favourite solution would be to have a way to apply NetworkNamespacePath to all sessions of users where pam_systemd.so is invoked or maybe simply a way to start ssh sessions as a transient unit (I don't know whether this is already possible, what it would entail, whether it is at all a good idea).
I know that netns are not cgroups but to me it would make sense to make them applicable just like the resource-control options.
I'm doing something similar and would also benefit from being able to allow a user to login to a predefined netns. I'm currently doing pretty much exactly what krumelmonster does, combined with some other trickery to get the whole user session to enter their own netns.
Thanks for joining in!
I'm using this trickery right now https://github.com/krumelmonster/libpam-net ping me in the issues there if you'd like to talk trickery. But yes, having NetworkNamespacePath for users would make the setup a lot less cumbersome