flannel icon indicating copy to clipboard operation
flannel copied to clipboard

Feature Request: Nebula Backend

Open AndreasBBS opened this issue 1 year ago • 4 comments

Expected Behavior

Enable support for the Nebula VPN as backend when using the flag --flannel-backend=nebula

Current Behavior

At the moment the only way I found to make flannel work with Nebula VPN was by setting the nebula network interface (--flannel-iface=nebula0) and setting the --node-ip and --external-node-ip to the IP of node withing the Nebula VPN.

Possible Solution

I'm not very familiarized with how flannel is implemented but I think that a lot of the backend implementation for the Wireguard backend should be reusable given the similar architecture of both Wireguard and Nebula when creating mesh VPNs.

Context

I've been wondering if the flannel team is working on a "--flannel-backend" for the Nebula VPN (https://github.com/slackhq/nebula). I've been trying to look for information on it but haven't found anything. Should not be very different than the wireguard backend and it's an alternative that's gaining a lot of traction for being completely open source and also developed by our friends Slack.

I had success with a k3s cluster (with defaults, klipper + traefik) and with all nodes having nebula configured. I set the adapter to the nebula adapter and the node-ip and external-ip to the ips on nebula network (192.168.69.0/24). Everything works fine, they talk to each other nicely and I can use ingresses to access the services from within the VPN.

There's only one thing that I have not managed to do. I wanted to have it so that the ingresses are only reachable through the nebula network (192.168.69.0/24) but I am also able to hit the ingresses and get the services served when I disable the nebula network and use the nodes "real ip" by pointing the dns to their "real ip" (in this case the nodes all have "real ips" in my home network 192.168.30.0/0).

I would like to add some VPS servers to my cluster by adding them to the nebula network but I don't feel comfortable exposing my internal cluster on a VPS node because everyone would be able to hit my ingresses and see my services if they manipulate their dns. Basically I wanna be able to have a private cluster that has some nodes on the web, hybrid. I seem to be making something wrong because in my mind if I set the node-ip, node-external-ip and the network adapter that should be all it takes to make sure I can only access within the nebula vpn. If anyone has any tips I'd be really happy cause at the moment my infrastructure at home is maxed out and I'd like to be able to rent some nodes outside.

Also, if there's already any official support for the Nebula VPN with flannel it would be awesome to be pointed in the right direction.

Your Environment

(This is the environment I mention in the context section)

  • K3S version: v1.24.4+k3s1
  • Flannel version: v0.19.1
  • Backend used (e.g. vxlan or udp): vxlan
  • Etcd version: v3.5.3-k3s1
  • Kubernetes version (if used): 1.25.0
  • Operating System and version: Ubuntu Server 22.04 (Same on all nodes)

AndreasBBS avatar Sep 15 '22 10:09 AndreasBBS

Thanks for the proposal, I believe it make a lot of sense. I think that the first approach should be trying to integrate Nebula using the extension backend. That backend is sort of a general backend that allows you to specify in the flannel conf the commands that must be run when specific events happen (new subnet, updated subnet, etc). Here is more documentation: https://github.com/flannel-io/flannel/blob/master/Documentation/extension.md. Could you have a look and verify whether that could be a possible way forward?

In my free time, I'm building a PoC where I am trying to integrate tailscale and I am using that extension backend. I can show it to you once I have a stable version :)

manuelbuil avatar Sep 15 '22 16:09 manuelbuil

One important aspect we should consider is how we specify the nebula interface to kubelet. By the time flannel executes all the nebula commands and thus sets up the nebula interface, kubelet would have had selected an interface and picked an IP. As a consequence, all your "control-plane" traffic will use a local interface, whereas your "data-plane" traffic will use the nebula interface. As a consequence, you will not be able to a "hybrid" cluster because kubelet will not be able to find kube-api, right?

manuelbuil avatar Sep 15 '22 16:09 manuelbuil

I've been reading the documentation you sent and it's been very helpful! I've been thinking how this could be possible to implement and this is what I'm thinking so far: (From what I understand when I read on the Documentation the phrase "remote host that was added" I assume the host here is the node and I'll be referring to the node. Also from my understanding this commands run on all nodes)

  • The nodes need to have both the nebula binary ( this binary seems to be responsible for creating the tun network interface based on a config.yml file and handling the connections that go through the tun device, something like socat it seems to me ) and also the nebula-cert binary to be able to sign the certificates that will allow each interface to participate in the network.
  • The PreStartupCommand create a public/private key par with nebula-cert keygen
  • The PostStartupCommand would
    • send the public key from preStartupCommand to the host that has the ca
    • wait for a signed certificate
    • create a config.yml with the received signed certificate and the SUBNET variable
    • start the nebula interface with nebula -config config.yml
  • The SubnetAddCommand would add subnet to the remote_allow_list in the config.yml (and restart the nebula process?)
  • The SubnetRemoveCommand would remove subnet from the remote_allow_list in the config.yml (and restart the nebula process?)

There's still a lot of questions I got. I'm not even close to have a correct conceptual model for how to do this. I hope I'm not bothering you to much with the following questions I have:

  • Do you think I'm going in the right direction with the proposal above
  • How do you go about setting a development environment to test this extensions and what type of tests to you do, what are you checking for when testing the extentions
  • I'm having trouble conceptualizing
    • where would the signed CA be in (control-plane nodes, all nodes, etcd, user-owned?)
    • what nodes should be lighthouse nodes (my intuition tells me the control-plane nodes but maybe not?)
    • when you sign a certificate you sign for a specific ip, would you need a certificate for each container (?)
    • what would adding and removing subnet really mean in this context, for now I think it would mean managing the remote_allow_list but I'm not sure this is the right approach

Of course I don't expect you to hold the answers to all this questions, especially the ones regarding the conceptualization. I'm mostly just letting my thoughts here to try and foster healthy discussion of how this can be done. I'm still in the process of studying nebula more in depth as well as flannel cause I don't feel like my current understanding of either technology is good enough to be able to do this. I'll keep posting as I make process.

NOTE: While researching I found this deprecated implementation of the Wireguard backend using the extension interface, it was a very useful example of how to use the extention.

AndreasBBS avatar Sep 17 '22 12:09 AndreasBBS

I've been reading the documentation you sent and it's been very helpful! I've been thinking how this could be possible to implement and this is what I'm thinking so far: (From what I understand when I read on the Documentation the phrase "remote host that was added" I assume the host here is the node and I'll be referring to the node. Also from my understanding this commands run on all nodes)

  • The nodes need to have both the nebula binary ( this binary seems to be responsible for creating the tun network interface based on a config.yml file and handling the connections that go through the tun device, something like socat it seems to me ) and also the nebula-cert binary to be able to sign the certificates that will allow each interface to participate in the network.

  • The PreStartupCommand create a public/private key par with nebula-cert keygen

  • The PostStartupCommand would

    • send the public key from preStartupCommand to the host that has the ca
    • wait for a signed certificate
    • create a config.yml with the received signed certificate and the SUBNET variable
    • start the nebula interface with nebula -config config.yml
  • The SubnetAddCommand would add subnet to the remote_allow_list in the config.yml (and restart the nebula process?)

  • The SubnetRemoveCommand would remove subnet from the remote_allow_list in the config.yml (and restart the nebula process?)

There's still a lot of questions I got. I'm not even close to have a correct conceptual model for how to do this. I hope I'm not bothering you to much with the following questions I have:

  • Do you think I'm going in the right direction with the proposal above

  • How do you go about setting a development environment to test this extensions and what type of tests to you do, what are you checking for when testing the extentions

  • I'm having trouble conceptualizing

    • where would the signed CA be in (control-plane nodes, all nodes, etcd, user-owned?)
    • what nodes should be lighthouse nodes (my intuition tells me the control-plane nodes but maybe not?)
    • when you sign a certificate you sign for a specific ip, would you need a certificate for each container (?)
    • what would adding and removing subnet really mean in this context, for now I think it would mean managing the remote_allow_list but I'm not sure this is the right approach

Of course I don't expect you to hold the answers to all this questions, especially the ones regarding the conceptualization. I'm mostly just letting my thoughts here to try and foster healthy discussion of how this can be done. I'm still in the process of studying nebula more in depth as well as flannel cause I don't feel like my current understanding of either technology is good enough to be able to do this. I'll keep posting as I make process.

NOTE: While researching I found this deprecated implementation of the Wireguard backend using the extension interface, it was a very useful example of how to use the extention.

I think you are going into the right direction. If I remember correctly, the SubnetAddCommand will only be called if the clusterCIDR in the node object changes, i.e. it is not called when deploying the cluster. Therefore, if "adding a subnet to the remote_allow_list in the config.yml" is required for things to work in Nebula, you should do that as part of the "PostStartupCommand" too.

How do you go about setting a development environment to test this extensions and what type of tests to you do, what are you checking for when testing the extentions

You should deploy k8s and then deploy flannel using the extension backend with the described commands. Then verify that pods come up and that you can ping among pods. Then check that the traffic is going through the nebula interfaces.

where would the signed CA be in (control-plane nodes, all nodes, etcd, user-owned?)

I guess this depends on the user's opinion. Any place is good as long as it is reachable by the rest

what nodes should be lighthouse nodes (my intuition tells me the control-plane nodes but maybe not?)

What is a lighthouse node? Sorry, I'm not familiar with how Nebula works

when you sign a certificate you sign for a specific ip, would you need a certificate for each container (?)

If Nebula works as other VPNs, it should be enough with the host IP. You want the pods traffic to be encapsulated inside the VPN

what would adding and removing subnet really mean in this context, for now I think it would mean managing the remote_allow_list but I'm not sure this is the right approach

This depends on how Nebula works. I suggest you to give it a try

manuelbuil avatar Sep 19 '22 05:09 manuelbuil

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Mar 18 '23 13:03 stale[bot]