runj icon indicating copy to clipboard operation
runj copied to clipboard

Networking

Open samuelkarp opened this issue 2 years ago • 23 comments

Continuing the conversation from https://github.com/samuelkarp/runj/issues/19, specifically about networking.

cc @davidchisnall, @gizahNL

samuelkarp avatar Oct 29 '21 06:10 samuelkarp

@davidchisnall wrote:

Thanks. I'm interested (time permitting) in working on some of the network integration (vnet + pf). Pot already seems to manage this reasonably well, so should provide a good reference. I don't have a very good understanding of how the various bits (containerd / runj / CNI) fit together (all of the docs seem to assume that you know everything already and throw terminology at you).

You shouldn't need nested jails for jail-to-jail networking, you 'just' need to set up the routing.

@gizahNL wrote:

You could take a look at my moby port. It has (barebones) working network and barebones pf support.

The strategy I used is creating a base jail that allows for a child jail to be spawned that does the vnet network, and a child jail that is the actual container. The rationale being that Linux containers lack the tools to configure the FreeBSD network stack, and Kubernetes pods assuming a shared network namespace.

I still have a PR open here that needs more work on it, but unfortunately I've been swamped with other commitments.

samuelkarp avatar Oct 29 '21 06:10 samuelkarp

You shouldn't need nested jails for jail-to-jail networking, you 'just' need to set up the routing.

Because Linux containers are a bit more Lego block-like, a really common pattern is to have containers share namespaces (and in particular share the network namespace). This allows for those containers to have a common view of ports, interfaces, routes, and IP space. Orchestrators (well, Kubernetes specifically) may have baked-in assumptions that a set of containers treated as a single unit (Kubernetes pod, Amazon ECS task, etc) have a single exposed IP address.

I don't know if nested jails are necessary for that, but that was the approach I saw @gizahNL use.

If you're interested in doing networking, how about starting with something basic first? My initial approach was to look at vnet=inherit as a default for now just so that a jail can have some network connectivity and then leave the more complicated bits for later since anything here would involve either (a) a new component or (b) changes to the OCI runtime spec (or both!).

I don't have a very good understanding of how the various bits (containerd / runj / CNI) fit together (all of the docs seem to assume that you know everything already and throw terminology at you).

The OCI runtime spec doesn't say much about networking for Linux containers. Typically, the bundle config describes either that a new network namespace should be created (a LinuxNamespace struct with the Type set to network and the Path empty) or that an existing namespace should be joined (the Path pointing to that namespace). Then something at a different layer (above the OCI runtime) is responsible for configuring that namespace with the appropriate network interfaces, routes, etc. This can be done by a CNI plugin (as is the case in Kubernetes, some situations in Amazon ECS), directly by a higher-level invoking runtime (like Docker does), or whatever other component you want; CNI is an optional and somewhat standard way to do it, but the whole setup is outside the scope of an OCI runtime on Linux anyway.

Looking at other operating systems: Windows containers appear to have a somewhat different modeling of networking with a WindowsNetwork struct in the bundle config that has some additional options around DNS and endpoints (which I assume are vNICs?). But there's also a NetworkNamespace ID specified there and my understanding is that multiple Windows containers can share the same network namespace.

From my (limited) knowledge of FreeBSD jails, it looks like there is a bit more structure around how a jail's network is configured. Specifically, I see vnet, ip4, and ip6 related options in the jail(8) manual page.

I'm not sure what the right path is for FreeBSD jails. The nested approach that @gizahNL suggested sounds to me like the closest to the existing Linux and Windows patterns that are used in the ecosystem and would likely play nicely with that style of separation of concerns where a separate component could configure the parent jail's network without input from runj. On the other hand, since jails do have more structure that could also be beneficial to expose via the bundle config (and add to the upstream specification). I'd love to have input here.

samuelkarp avatar Oct 29 '21 06:10 samuelkarp

In addition to fitting best with the current existing approaches for other OS's using a nested approach goes around the lack of FreeBSD network tooling in Linux container images. Of course that could be solved by mounting static linked binaries for that purpose into the Linux container, but that imho has more moving parts, and feels more likely to break. The simplest solution to me was to create a base jail with its root set to /, so that all tools from the host are available, and are of the correct versions.

gizahNL avatar Oct 29 '21 06:10 gizahNL

create a base jail with its root set to /

I'm not sure how much of a risk that would be on FreeBSD but it's something I'd generally avoid doing on Linux as it could increase risks related to container breakout or data exfiltration.

I wonder if it would be better to create a rootfs with just the set of tools that you need and then create a base jail from that rootfs.

samuelkarp avatar Oct 29 '21 06:10 samuelkarp

create a base jail with its root set to /

I'm not sure how much of a risk that would be on FreeBSD but it's something I'd generally avoid doing on Linux as it could increase risks related to container breakout or data exfiltration.

I wonder if it would be better to create a rootfs with just the set of tools that you need and then create a base jail from that rootfs.

Yes that would also work. I'd argue the risk is minimal since no code would be running in the base jail, except the networking configuration commands, which are fired off by a tool that I assume already has full root access. For the case of minimising risk it makes sense of course.

gizahNL avatar Oct 29 '21 08:10 gizahNL

Thanks for the excellent write-up. I am really nervous about anything involving nested jails because you need to be very careful to avoid jail escapes when you use nested jails. There are a bunch of race conditions with filesystem access and you have to make sure that the outer jail will never do any of the things that will allow the inner jail to exploit them. This is why nested jails weren't supported for so long and why they came with big warnings when they were introduced.

Can you clarify a bit what you mean by the expectation that jails share an IP address? Does this assume that the code inside the jail sees the public IP address (i.e. no port forwarding / NAT)? Or just that any jail can establish connections with another on the same machine unless explicitly prevented by a firewall? Between NetGraph and VNET, there's a huge amount of flexibility in what can be expressed. I believe the most common idiom is for each jail to have a private IP address that is locally bridged so any jail (and the host) can connect to any other jail's IP address but for public services they must have explicit port forwarding. Outbound connections are NAT'd for IPv4, for IPv6 it's a bit simpler as the jail's IP address can be either made public and inbound ports can be either explicitly opened or blocked.

If that abstraction works for K8s, then that's great but otherwise I'd like to understand a bit more about what it wants.

As a high-level point: The FreeBSD Foundation has now committed to investing in container support for FreeBSD, with the remainder of this year being spent on building a concrete plan. We shouldn't try to work around missing features in FreeBSD, we should document what is missing. If FreeBSD's Linux ABI layer needs the ioctls for Linux's ip command to work and they don't currently, then we should raise that. If we need to be able to assign the same VNET instance to multiple jails, that's also something that we can ask for.

davidchisnall avatar Oct 29 '21 09:10 davidchisnall

. I am really nervous about anything involving nested jails because you need to be very careful to avoid jail escapes when you use nested jails. There are a bunch of race conditions with filesystem access and you have to make sure that the outer jail will never do any of the things that will allow the inner jail to exploit them. This is why nested jails weren't supported for so long and why they came with big warnings when they were introduced.

Oof, if that is still true and nested jails are still very risky compared to normal jails that would make that strategy a no-go.

The FreeBSD foundation wanting to work at container support is great news.

Top of my wishlist would indeed be to decouple vnet from jails (I looked at the kernel code it's relatively doable, but also a bit much for a "my first kernel Code" thing, so I didn't go there ;) )

Second would be the ability to configure a vnet instance from the host OS without depending on anything inside the vnet jail, ifconfig, route, pfctl and co taking a jail param would likely be enough to start, though a nicer thing would be a (relatively) simple API to do most networking stuff (I couldn't get myself to grok the ioctl style config yet for interfaces & pf, it all seemed quite dense, and from what I read at least wrt ifconfig those ioctls are not really meant to program against yourself).

For me personally getting Linux ip command to work is of lesser importance (I think it doesn't use ioctls anymore but a new socket type invented for it). I don't think there are many containers that depend on doing their own network configuration.

gizahNL avatar Oct 29 '21 17:10 gizahNL

I am really nervous about anything involving nested jails because you need to be very careful to avoid jail escapes when you use nested jails. There are a bunch of race conditions with filesystem access and you have to make sure that the outer jail will never do any of the things that will allow the inner jail to exploit them. This is why nested jails weren't supported for so long and why they came with big warnings when they were introduced.

Thanks for the heads up. I'd love to read more about this if you have any resources handy.

Can you clarify a bit what you mean by the expectation that jails share an IP address?

Yes, absolutely. I'm going to respond to your statements out-of-order since I think that'll make the answers more clear.

I believe the most common idiom is for each jail to have a private IP address that is locally bridged so any jail (and the host) can connect to any other jail's IP address but for public services they must have explicit port forwarding.

This is the default networking mode in Docker (also called the "bridge" mode). On Linux, Docker creates a (by default) docker0 bridge and uses a veth pair to connect the container to the bridge. The bridge has a defined subnet (172.17.0.0/16) and Docker handles IPAM. Outbound connections are NAT'd and port forwarding can be configured to expose services.

Orchestrators like Amazon ECS use this mode by default and have placement logic to handle conflicts on exposed ports. Kubernetes, on the other hand, has explicitly chosen to avoid this and try to present a simpler model to applications running within the cluster (at the cost of additional complexity for the person operating the cluster).

Does this assume that the code inside the jail sees the public IP address (i.e. no port forwarding / NAT)? Or just that any jail can establish connections with another on the same machine unless explicitly prevented by a firewall? Between NetGraph and VNET, there's a huge amount of flexibility in what can be expressed. [...] If that abstraction works for K8s, then that's great but otherwise I'd like to understand a bit more about what it wants.

Not precisely. Let's talk about Kubernetes specifically for a moment. The Kubernetes project has documentation on the networking model but I'll attempt to summarize as well. In Kubernetes, there are two assumptions that are core to the networking model: (1) processes in the same pod (regardless of which container they're in) have a view of the network as if they were just processes running on the same machine; i.e., they can communicate with each other over localhost and will conflict with each other if they attempt to expose services on the same port, and (2) all pods within the cluster can communicate with each other by using the pod's IP address (i.e., the IP address is routable within the cluster) without NAT.

For (1) this is accomplished by sharing a network namespace. Each container in the pod sees the exact same set of network interfaces; there is no isolation between them. localhost in one container is the same localhost in another container in the same pod, eth0 in one container is the same eth0 in another container in the same pod, etc.

For (2), a CNI plugin (or set of chained CNI plugins) is responsible for adding an interface to the pod's network namespace that the pod (i.e., all the containers in the pod) can use for its outbound connections (and exposed services). A CNI plugin (the same or another) is responsible for IPAM within the cluster; pods do not typically have public (Internet-routable) IPv4 addresses and instead typically have a private-range address. There are a variety of mechanisms to do this; various types of overlay networks or vlan setups are common, or cloud providers like AWS may integrate with that cloud's network primitive (VPC) and attach an interface to the host (i.e., an ENI).

Stepping away from Kubernetes, orchestrators like Amazon ECS can do this too, though it's less core to their networking models. Either way, the underlying primitive being used is the ability to share a network namespace among a set of containers rather than giving each container its own, isolated, view of the network.

As a high-level point: The FreeBSD Foundation has now committed to investing in container support for FreeBSD, with the remainder of this year being spent on building a concrete plan. We shouldn't try to work around missing features in FreeBSD, we should document what is missing.

Networking-wise: if FreeBSD does not already have a mechanism for a set of jails to share interfaces/view of the network (like shared network namespaces in Linux), I think that would be a very useful thing to add. I don't know enough about FreeBSD networking yet to know if that is the case or to know if @gizahNL's suggestion for sharing vnet instances is the right approach (though from my limited reading that does sound correct).

Second would be the ability to configure a vnet instance from the host OS without depending on anything inside the vnet jail

This also sounds useful, but could be worked around. On Linux, namespaces are garbage-collected by the kernel unless there is either an active process or mount holding the namespace open. In order to have a network namespace with a lifetime decoupled from the containers that make up a pod (in Kubernetes) or a task (in Amazon ECS), a common technique is to create a "pause container" that exists just to hold the namespace open and give an opportunity for that namespace to be fully configured (e.g., for the CNI plugins to run) ahead of the workload starting. A similar technique could be used here (if vnet sharing is the right approach) where a jail is created with the necessary tools for the express purpose of configuring the vnet.

I'm not sure what else would be useful to add to FreeBSD yet; I'm sure we'll all learn more as we continue to talk and experiment.

samuelkarp avatar Oct 30 '21 07:10 samuelkarp

Thinking about this a bit more, it feels like Docker is a much better fit for the non-VNET model. The jails I used to manage had a very simple networking setup. I created a new loopback adaptor (lo1) and assigned them each an IP on that. They could all communicate, because they were on the same network interface. I then used pf to NAT these IPs and forward ports.

VNET is newer but it isn't necessarily better. It allows more things (for example, raw sockets, which would allow jailed processes to forge the header if they weren't hidden behind a firewall that blocked faked source IPs) and it comes with different scalability issues. With VNET, each jail gets a separate instance of the network stack. This consumes more kernel memory but avoids lock contention. Generally, it's a good choice if you have a lot of RAM, a lot of cores, and a lot of jails, but for deployments with a handful of jails it will add overhead that you don't need. For a client device doing docker build it's probably not better.

For K8s, it's probably worth exposing some of the Netgraph bits to allow more abitrary network topologies for a particular deployment.

davidchisnall avatar Nov 08 '21 09:11 davidchisnall

Thinking about this a bit more, it feels like Docker is a much better fit for the non-VNET model. The jails I used to manage had a very simple networking setup. I created a new loopback adaptor (lo1) and assigned them each an IP on that. They could all communicate, because they were on the same network interface. I then used pf to NAT these IPs and forward ports.

VNET is newer but it isn't necessarily better. It allows more things (for example, raw sockets, which would allow jailed processes to forge the header if they weren't hidden behind a firewall that blocked faked source IPs) and it comes with different scalability issues. With VNET, each jail gets a separate instance of the network stack. This consumes more kernel memory but avoids lock contention. Generally, it's a good choice if you have a lot of RAM, a lot of cores, and a lot of jails, but for deployments with a handful of jails it will add overhead that you don't need. For a client device doing docker build it's probably not better.

For K8s, it's probably worth exposing some of the Netgraph bits to allow more abitrary network topologies for a particular deployment.

That won't work afaik, because Docker containers assume localhost to be 127.0.0.1, and assume it to be non shared. Afaik vnet is needed to give a jail its own loopback networking.

gizahNL avatar Nov 08 '21 10:11 gizahNL

Related Moby issue: https://github.com/moby/moby/issues/33088

gizahNL avatar Nov 08 '21 11:11 gizahNL

That won't work afaik, because Docker containers assume localhost to be 127.0.0.1, and assume it to be non shared. Afaik vnet is needed to give a jail its own loopback networking.

I don't believe that this is true. If you try to bind to 127.0.0.1 in a non-VNET jail, you will instead bind to the first IP provided to the jail. If you create a lo1 and assign a jail to the IP 127.0.0.2 there, then the jail attempting to bind to 127.0.0.1:1234 will instead bind to 127.0.0.1:1234 on lo1, and lo0 for the host will be completely unaffected.

davidchisnall avatar Nov 08 '21 12:11 davidchisnall

The kubernetes model groups containers into 'pods' which share a network namespace and there is an explicit expectation that containers in the pod can communicate via localhost (https://kubernetes.io/docs/concepts/workloads/pods/#pod-networking).

In this model, nothing runs at the pod level so there should be no issues with a two-level jail structure with the pod's jail owning the vnet and child jails for each container.

dfr avatar Apr 26 '22 06:04 dfr

[I'm highly interested in getting basic CNI support in runj to support basic networking for Linux containers.

According to the CNI spec, the runtime needs to execute the CNI plugin.

I know there are efforts to port the CNI-supported plugins to FreeBSD, but I'm working on a pretty minimal and only-partially compliant placeholder CNI plugin for use with the new containerd support for FreeBSD.

As CNI support in the runtime would be critical to Kubernetes node-level support, is there any work done on adding that into runj? I can try to work on it to some degree but my Go skills are pretty basic.

kbruner avatar Sep 07 '22 15:09 kbruner

I have some mostly working CNI plugins for FreeBSD here: https://github.com/dfr/plugins/tree/freebsd. These assume the 'netns' parameter for the plugin is the name of a VNET jail. The container jail is nested in the VNET jail which lets all the containers in a pod communicate via localhost.

dfr avatar Sep 07 '22 15:09 dfr

Also, as far as I can tell from working with the github.com/containers stack, common practice is for CNI plugins to be executed by the container engine (e.g. podman, buildah, cri-o, containerd), initialising a network namespace (or jail for freebsd) which is passed to the runtime via the runtime spec.

dfr avatar Sep 07 '22 16:09 dfr

I'm more interested on the Linux container side. I have no idea what's actually involved there as far as shoehorning that support into containerd and/or runj, or how much that overlaps with support for jails.

kbruner avatar Sep 07 '22 16:09 kbruner

@dfr is correct; CNI support should be in the caller of runj rather than runj itself. containerd supports CNI plugins in its CRI implementation today. runj needs to support the networking primitives that the CNI plugins would then configure (the equivalent of a network namespace on Linux). I'm also interested in supporting networking outside CNI in the context of what jail(8) already supports.

samuelkarp avatar Sep 08 '22 04:09 samuelkarp

I have some mostly working CNI plugins for FreeBSD here: https://github.com/dfr/plugins/tree/freebsd.

:+1:

These assume the 'netns' parameter for the plugin is the name of a VNET jail. The container jail is nested in the VNET jail which lets all the containers in a pod communicate via localhost.

I think you can just change the parameter name from nens to something like vnet. The plugin name can be also changed to something like freebsd-vnet or freebsd-bridge (for consistency with win-bridge.

AkihiroSuda avatar Sep 08 '22 07:09 AkihiroSuda

I think you can just change the parameter name from nens to something like vnet. The plugin name can be also changed to something like freebsd-vnet or freebsd-bridge (for consistency with win-bridge.

I like the idea of changing the parameter name - I'll look into that. I'm mostly against changing the plugin name - I like it being called 'bridge' for consistency with linux - this means that things like 'podman network create' just work on FreeBSD.

dfr avatar Sep 08 '22 09:09 dfr

In #32 I've added a mechanism for runj to model FreeBSD extensions to the runtime spec and added a couple networking-related settings using that mechanism. The end result is that runj can now configure jails to have access to the host's IPv4 network stack (similar to host networking for Linux containers). I'd be happy to take more contributions using this mechanism that model additional network settings (including those that might be needed by CNI plugins like interfaces and VNET settings) as well as modeling parent-child jail relationships.

samuelkarp avatar Sep 11 '22 05:09 samuelkarp

I had been thinking of just exposing an interface which allows the container engine to explicitly set jail parameters for the container. Not sure which is best but this approach puts the policy choices for the container in the engine which makes sense to me. A possible approach might look like https://github.com/dfr/runtime-spec/commit/2caca1237bffab13b0e41ae00d35cf17d9b3394c

On Sun, 11 Sept 2022 at 06:21, Samuel Karp @.***> wrote:

In #32 https://github.com/samuelkarp/runj/pull/32 I've added a mechanism for runj to model FreeBSD extensions to the runtime spec and added a couple networking-related settings using that mechanism. The end result is that runj can now configure jails to have access to the host's IPv4 network stack (similar to host networking for Linux containers). I'd be happy to take more contributions using this mechanism that model additional network settings (including those that might be needed by CNI plugins like interfaces and VNET settings) as well as modeling parent-child jail relationships.

— Reply to this email directly, view it on GitHub https://github.com/samuelkarp/runj/issues/20#issuecomment-1242889667, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABOTJJWUXHTLOBCVIYEMUDV5VT4HANCNFSM5G6QLLBQ . You are receiving this because you were mentioned.Message ID: @.***>

dfr avatar Sep 11 '22 15:09 dfr

@dfr Thanks, that's an interesting approach. I think it's reasonable as a prototyping mechanism that we could add to runj, but probably not something I think would be appropriate to upstream into the spec itself. I would expect the spec to have a slightly higher level of abstraction such that the backend could be swapped out for something that isn't a jail (for example, possibly a bhyve VM) but still supports largely the same set of FreeBSD-specific features. As an example, the Linux portion of the spec models cgroups (which are used for resource limits) but it doesn't specify the exact materialization into cgroupfs.

samuelkarp avatar Sep 13 '22 05:09 samuelkarp

I've started playing around with vnet and trying to set up a bridged network similar to what Docker does on Linux, but I'm having trouble figuring out what I'm missing (probably both that I'm misunderstanding exactly what Docker is doing and that I'm failing to translate that to FreeBSD). On Linux, Docker creates a bridge and then a veth pair for each container, adding one end to the bridge and moving the other end into the container. Inside the container, the veth is set up with an IP address and that IP is then used as the next hop for the default route. There is also a set of iptables rules created on the host, though I'm not sure if those are used for normal traffic forwarding or are primarily used for exposing ports. The bridge is a separate non-overlapping CIDR from the host's network (172.17.0.0/16 by default) and something (?) is performing NAT.

iptables configuration
# Generated by iptables-save v1.8.7 on Tue Nov 29 19:30:42 2022
*filter
:INPUT ACCEPT [0:0]
:FORWARD DROP [0:0]
:OUTPUT ACCEPT [0:0]
:DOCKER - [0:0]
:DOCKER-ISOLATION-STAGE-1 - [0:0]
:DOCKER-ISOLATION-STAGE-2 - [0:0]
:DOCKER-USER - [0:0]
-A FORWARD -j DOCKER-USER
-A FORWARD -j DOCKER-ISOLATION-STAGE-1
-A FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
-A FORWARD -o docker0 -j DOCKER
-A FORWARD -i docker0 ! -o docker0 -j ACCEPT
-A FORWARD -i docker0 -o docker0 -j ACCEPT
-A DOCKER-ISOLATION-STAGE-1 -i docker0 ! -o docker0 -j DOCKER-ISOLATION-STAGE-2
-A DOCKER-ISOLATION-STAGE-1 -j RETURN
-A DOCKER-ISOLATION-STAGE-2 -o docker0 -j DROP
-A DOCKER-ISOLATION-STAGE-2 -j RETURN
-A DOCKER-USER -j RETURN
COMMIT
# Completed on Tue Nov 29 19:30:42 2022
# Generated by iptables-save v1.8.7 on Tue Nov 29 19:30:42 2022
*nat
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
:DOCKER - [0:0]
-A PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
-A OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE
-A DOCKER -i docker0 -j RETURN
COMMIT
# Completed on Tue Nov 29 19:30:42 2022

I've been able to follow this guide to bridge an epair inside a jail with the primary interface in my VM and allow the jail to initiate DHCP from the network attached to the VM (in this case, VirtualBox's built-in DHCP server).

That's not quite the same thing though. I can also omit DHCP and do static IP addressing for the bridge and for the epair interfaces (either side?), though no matter what I do I don't have working bidirectional network. I suspect packets are being sent but nothing is being received back, and as I'm typing this out to explain what I'm seeing I'm thinking that I'm likely missing something about configuring NAT.

Here's what I've been doing

On the host/VM:

# ifconfig bridge0 create
# ifconfig epair0 create
# ifconfig bridge0 inet 172.17.0.0/16
# ifconfig bridge0 \
    addm em0 \
    addm epair0a \
    up

In my jail.conf:

  vnet;
  vnet.interface = "epair0b";
  allow.mount;
  allow.raw_sockets = 1;
  mount.devfs;

Inside the jail:

# ifconfig epair0b link 00:00:00:00:00:01
# ifconfig epair0b inet 172.17.0.2/32
# route add -net default 172.17.0.2

I've also tried:

  • setting the IP address on epair0a
    • the same as epair0b
    • different from epair0b
  • setting the default route to be 172.17.0.1 or 172.17.0.0
  • having only epair0a on the bridge without em0

I'm going to continue looking, but figured I'd post here in case anyone has suggestions/pointers for me to look at.

Meanwhile: I'll be adding vnet and vnet.interface parameters to runj so at least the workflow described in this guide could be adapted to work for runj.

samuelkarp avatar Nov 30 '22 03:11 samuelkarp

I've started playing around with vnet and trying to set up a bridged network similar to what Docker does on Linux, but I'm having trouble figuring out what I'm missing (probably both that I'm misunderstanding exactly what Docker is doing and that I'm failing to translate that to FreeBSD).

You can have a look at VNET jails management of CBSD I use the VNET jails managed by CBSD now. There are epair interfaces for every jail and a bridge interface to go out of jails. To communicate with the external world there are some pf (or ipfw) rules - automatic 'hide' NAT and managed by cbsd expose incoming PAT.

Peter2121 avatar Nov 30 '22 08:11 Peter2121

I use a very similar approach to handle networking for podman and buildah. Take a look at https://github.com/dfr/plugins/tree/freebsd - the code which manages the epairs is in pkg/ip/link_freebsd.go. Interface addresses are assigned from a private address pool using ipam and NAT is enabled by putting those addresses into a PF table used by nat rules in /etc/pf.conf.

These plugins are in the ports tree and you can install them with pkg install containernetworking-plugins. I believe that containerd supports CNI so you may be able to use this directly. The name of the vnet jail is passed in via the CNI_NETNS environment variable (typically managed by github.com/containernetworking/cni/libcni). This could be the container jail but requires compatible ifconfig and route binaries inside the container. As you know, for podman/buildah I use a separate jail for networking with containers as children of the networking jail.

dfr avatar Nov 30 '22 10:11 dfr

@dfr thanks for that! I've tried reading through the code in the bridge plugin and I'm ending up with steps that are roughly the same as what I was doing (and I'm running into similar problems). I did find where the PF table is manipulated, but I'm guessing that I'm missing the table creation since all I see are add and delete commands.

Here's what I've been trying
  1. Create the bridge: ifconfig bridge create name bridge0
  2. Create the epair: ifconfig epair create
  3. Set a description (I didn't know this was a thing!): ifconfig epair0a description "host-side interface"
  4. Set a mac address on the jail-side interface: ifconfig epair0b link 00:00:00:00:00:01
  5. Add the host-side interface to the bridge: ifconfig bridge0 addm epair0a
  6. Bring the host-side interface up: ifconfig epair0a up
  7. Add an IP and subnet mask to the bridge: ifconfig bridge0 alias 172.17.0.1/16
  8. Enable IP forwarding: sysctl net.inet.ip.forwarding=1
  9. Add the jail IP address to a PF table: pfctl -t jail-nat -T add 172.17.0.2/32
  10. Start a jail and pass the epair0b interface into the vnet (I did this in a jail.conf file)
  11. (inside the jail) Assign the IP to the interface: ifconfig epair0b inet 172.17.0.2/32
  12. (inside the jail) Bring the interface up: ifconfig epair0b up
  13. (inside the jail) Add a route to the bridge, using the epair0b IP as the gateway: route -4 add 172.17.0.1/16 172.17.0.2
  14. (inside the jail) Add a default route using the bridge gateway: route -4 add default 172.17.0.1

However if I try to ping an IP address (8.8.8.8, for example) I get this output: ping: sendto: Invalid argument

From outside the jail, the route table looks like this:

% netstat -nr
Routing tables

Internet:
Destination        Gateway            Flags     Netif Expire
default            10.0.2.2           UGS         em0
10.0.2.0/24        link#1             U           em0
10.0.2.15          link#1             UHS         lo0
127.0.0.1          link#2             UH          lo0
172.17.0.0/16      link#3             U       bridge0
172.17.0.1         link#3             UHS         lo0

Internet6:
Destination                       Gateway                       Flags     Netif Expire
::/96                             ::1                           UGRS        lo0
::1                               link#2                        UHS         lo0
::ffff:0.0.0.0/96                 ::1                           UGRS        lo0
fe80::/10                         ::1                           UGRS        lo0
fe80::%em0/64                     link#1                        U           em0
fe80::a00:27ff:fef3:cd05%em0      link#1                        UHS         lo0
fe80::%lo0/64                     link#2                        U           lo0
fe80::1%lo0                       link#2                        UHS         lo0
ff02::/16                         ::1                           UGRS        lo0

From inside the jail, it looks like this:

# netstat -nr
Routing tables

Internet:
Destination        Gateway            Flags     Netif Expire
default            172.17.0.1         UGS     epair0b
172.17.0.0/16      172.17.0.2         UGS     epair0b
172.17.0.2         link#2             UH          lo0

I see the following line in dmesg:

arpresolve: can't allocate llinfo for 172.17.0.1 on epair0b

samuelkarp avatar Dec 01 '22 07:12 samuelkarp

The table is automatically created when something is added. It looks like you are doing everything right - I believe the error is coming from the jail itself. Ttry adding allow.raw_sockets to the jail config.

On Thu, 1 Dec 2022 at 07:50, Samuel Karp @.***> wrote:

@dfr https://github.com/dfr thanks for that! I've tried reading through the code in the bridge plugin and I'm ending up with steps that are roughly the same as what I was doing (and I'm running into similar problems). I did find where the PF table is manipulated, but I'm guessing that I'm missing the table creation since all I see are add and delete commands. Here's what I've been trying

  1. Create the bridge: ifconfig bridge create name bridge0
  2. Create the epair: ifconfig epair create
  3. Set a description (I didn't know this was a thing!): ifconfig epair0a description "host-side interface"
  4. Set a mac address on the jail-side interface: ifconfig epair0b link 00:00:00:00:00:01
  5. Add the host-side interface to the bridge: ifconfig bridge0 addm epair0a
  6. Bring the host-side interface up: ifconfig epair0a up
  7. Add an IP and subnet mask to the bridge: ifconfig bridge0 alias 172.17.0.1/16
  8. Enable IP forwarding: sysctl net.inet.ip.forwarding=1
  9. Add the jail IP address to a PF table: pfctl -t jail-nat -t add 172.17.0.2/32
  10. Start a jail and pass the epair0b interface into the vnet (I did this in a jail.conf file)
  11. (inside the jail) Assign the IP to the interface: ifconfig epair0b inet 172.17.0.2/32
  12. (inside the jail) Bring the interface up: ifconfig epair0b up
  13. (inside the jail) Add a route to the bridge, using the epair0b IP as the gateway: route -4 -add 172.17.0.1/16 172.17.0.2
  14. (inside the jail) Add a default route using the bridge gateway: route -4 -add default 172.17.0.1

However if I try to ping an IP address (8.8.8.8, for example) I get this output: ping: sendto: Invalid argument

— Reply to this email directly, view it on GitHub https://github.com/samuelkarp/runj/issues/20#issuecomment-1333342273, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABOTJMKMR4PJRE2CCP2QP3WLBKCZANCNFSM5G6QLLBQ . You are receiving this because you were mentioned.Message ID: @.***>

dfr avatar Dec 01 '22 08:12 dfr

This is the current jail.conf I'm using:

foo {
  host.hostname = "jail3";
  path = "/";
  persist;
  vnet;
  vnet.interface = "epair0b";
  allow.mount;
  allow.raw_sockets = 1;
  mount.devfs;
  devfs_ruleset = 110;
}

(You can see I'm very creative with names like "foo" and "jail3"). The devfs ruleset is:

[devfsrules_jail_vnet_sam=110]
add include $devfsrules_hide_all
add include $devfsrules_unhide_basic
add include $devfsrules_unhide_login
add include $devfsrules_jail
add include $devfsrules_jail_vnet
add path 'bpf*' unhide

(this was needed for dhclient to work per the guide I was following before)

I've tried this both with an empty /etc/pf.conf (since I didn't have PF set up at all before) and with this content:

pass from bridge0:network to any keep state

I see that you responded from email; I updated that comment on GitHub with a bit more information there too.

samuelkarp avatar Dec 01 '22 08:12 samuelkarp

This is my pf.conf. The weird v4fib0egress stuff is coming from sysutils/egress-monitor - you can replace with the outgoing interface name:

v4egress_if = "v4fib0egress"
v6egress_if = "v6fib0egress"
nat on $v4egress_if inet from <cni-nat> to any -> ($v4egress_if)
nat on $v6egress_if inet6 from <cni-nat> to !ff00::/8 -> ($v6egress_if)
rdr-anchor "cni-rdr/*"
table <cni-nat>

dfr avatar Dec 01 '22 08:12 dfr