kube-router icon indicating copy to clipboard operation
kube-router copied to clipboard

Proposal to refactor some internal workings

Open dlamotte opened this issue 7 years ago • 2 comments

We've been using kube-router for a while now and I've been hacking on it enough now that I have some idea of what's going on internally now. I'd like to propose some core architectural changes to how services are handled internally and how they interop with the OS they're interacting with.

I want to say thank you for all the hard work that has gone into kube-router so far. It's exposing amazing abilities to end users and I love the components and where it's at today. It's definitely a great solution and I hope to help make it better. So I'm hoping I don't come across in the wrong way with this proposal. I thought it'd be better to propose something and discuss before writing code.

Why am I proposing this? We've been stumbling into odd edge cases due to the organic growth of the code base which gives me some pause as we roll this into production in our on-premises environment. And we're able to submit fixes for those and I am doing that, but I can't help but see that some foundational changes to the internal structure would make these bugs obvious. This proposal would tighten up the implementation of many features into a single page of code which would make them painfully obvious instead of hard to connect over several areas. Using this model, we could easily scale the feature set of kube-router and easily increase the surface area of unit testable code.

First, change how services are modeled internally to something along the lines of an internal kube-router projection of a service that wraps kubernetes services.

type KubeRouterService struct {
    Annotations struct {
        NodeASN       int64
        PeerASNs      int64
        PeerIPs       []net.IP
        PeerPasswords []string
        RRClient      int
        RRServer      int
        Local         bool
    }
    Service *v1.Service
}

func NewKubeRouterService(svc *v1.Service) (*KubeRouterService) {
    ... parse annotations and setup new KubeRouterService object ...
}

This will move all annotation parsing to the edges of where Kubernetes Services are consumed by APIs and give a consistent type to use and follow within the code base.

Second, piling onto the KubeRouterService type, we should teach it to install itself onto a system using dependency injection. This will increase our ability to cache the state of the host as well as increase the ability to unit test the Sync process which appears to be underway with the linuxNetworking interface.

svc := NewKubeRouterService(kubeSvc)

// mythical function along similar lines of the current linuxNetworking
// type that exists in the code base today, deeper explanation later
hostNetwork := NewHostNetworkInterface()

// communicate with hostnetwork interface and bgp routing protocol/RIB
// directly from service
svc.Sync(hostNetwork, bgp)

The Sync interface would be responsible for ensuring all objects within the network are properly sync'd. This includes:

  • ensuring all routes are created across all routing tables
  • ensuring all IPs are added to appropriate interfaces (within container/pod or on host)
  • ensuring all ipvs services are created and endpoints updated
  • ensuring advertisements via bgp are properly created/managed
  • ensuring all aspects of above that shouldn't exist are forcibly deleted everytime to prevent dead routes/IPs/ipvs services from causing issues (this is the critical piece that I hope becomes simpler with this new architecture)

Third, the HostNetworkInterface type will model all the OS specific details and allow for populating the state of the host into its datastructure. This will allow us to populate the state of all networking at the beginning of a tight loop of services and then pass this state into the Sync method and bypass expensive calls to the host. It will also allow us to implement a simple mark and sweep garbage collection mechanism that will allow us to simply and correctly cleanup all dead artifacts due to deleted services and/or outside environmental influence on the host's state (ie: adding a route by hand).

With these changes, we can efficiently act upon changes in kubernetes services incrementally and avoid sync'ing all services (so its compute time is O(1) for service updates vs O(n) where n is the number of services in the system) for fast convergence of the network. And maintain the reconciliation loop at high performance on a periodic basis to converge on the desired state of the cluster.

As much as possible, the new APIs should limit direct queries to kubernetes API by using local caches via the Informer interface. It's not clear how it'd best fit into this new model yet as I have limited experience with the Informer interface (I realize kube-router uses this already, just want to cement the fact that we should avoid expensive API calls as much as possible).

Thanks again for kube-router and I hope you find this helpful. I'd love to discuss all aspects of this and figure out how we could reach a common vision to refactor the internals to realize these goals. As much as possible, I'd love to help in writing this refactoring in chunks we are ok with to merge into master. I know you're probably extremely busy as of late and really appreciate your time in working with me on this. I hope you find this interesting in helping you resolve some of the issues coming into the project.

dlamotte avatar May 09 '18 17:05 dlamotte

@dlamotte are you still using kube-router? Any of the above ideas already implemented?

@murali-reddy can you give a short update on what your position is on the above proposal?

asteven avatar Nov 22 '18 09:11 asteven

@asteven I really wanted to implement this, but never got the time. At this point, I don't see myself implementing it in the near future unfortunately.

dlamotte avatar Nov 22 '18 12:11 dlamotte

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Sep 06 '23 02:09 github-actions[bot]

This issue was closed because it has been stale for 5 days with no activity.

github-actions[bot] avatar Sep 11 '23 02:09 github-actions[bot]