enhancements KEP-2104: kube-proxy v2: reworking the proxy's architecture

Hi folks,

I started the process of writing the KEP for my proposal of a "kube-proxy v2", and I think I'm ready to get some early comments. Thanks in advance for kind reviews and ideas :-)

@thockin

Oct 10 '20 23:10 mcluseau

Welcome @mcluseau!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. :smiley:

Oct 10 '20 23:10 k8s-ci-robot

Hi @mcluseau. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Oct 10 '20 23:10 k8s-ci-robot

/cc

Nov 04 '20 17:11 robscott

That's next Monday, yes? I am in.

On Wed, Dec 9, 2020 at 1:52 PM Mikaël Cluseau [email protected] wrote:

@mcluseau commented on this pull request.

In keps/sig-network/2104-reworking-kube-proxy-architecture/README.md https://github.com/kubernetes/enhancements/pull/2094#discussion_r539675592 :

+ + +Rewrite the kube-proxy to be a "localhost" gRPC API provider that will be accessible as usual via +TCP (127.0.0.1:12345) and/or via a socket (unix:///path/to/proxy.sock). + +- it will connect to the API server and watch resources, like the current proxy; +- then, it will process them, applying Kubernetes specific business logic like topology computation

relative to the local host; +- finally, provide the result of this computation to client via a gRPC watchable API.

update: the best choice is currently on 14th at 22:30 UTC (14:30 PST)

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/enhancements/pull/2094#discussion_r539675592, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKWAVAP56SDRVRNV4OHSETST7WSFANCNFSM4SLLSI3Q .

Dec 09 '20 21:12 thockin

That should work for me also. Let me know the invite details.

Dec 09 '20 22:12 mattklein123

Works for me.

Dec 09 '20 22:12 htuch

yes, it's next monday, I'll post the link. I'll use meet unless there's a strong preference for zoom.

Dec 09 '20 23:12 mcluseau

@mcluseau is this still on for today? Is there any info on the meeting address?

Dec 14 '20 18:12 howardjohn

@howardjohn yes it's still today, I'll post the link here in 3h (30 min before the meeting)

Dec 14 '20 18:12 mcluseau

hi all, meeting link: meet.google.com/jjf-szrn-qre

Dec 14 '20 21:12 mcluseau

That's a great example and we should keep it in mind.

On Wed, Dec 16, 2020 at 4:00 PM John Howard [email protected] wrote:

@howardjohn commented on this pull request.

In keps/sig-network/2104-reworking-kube-proxy-architecture/README.md https://github.com/kubernetes/enhancements/pull/2094#discussion_r544710263 :

+## Motivation + + + +### Goals + +- provide a node-level abstraction of the cluster-wide Service semantics through an API +- allow easier, more stable proxy implementations that don't need updates when Service business

So as one example, Istio (and others) need the ServiceAccount from an endpoint. This info is not present. As a result, we would still need to watch the full Endpoint, at which point it probably doesn't make sense to use this new api as well, since its redundant. SA is just one example, I am sure the list is quite long if we dig into it all.

I am not trying to say this KEP must support these use cases - but we should be very clear about whether it is intended to and whether it actually will.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/enhancements/pull/2094#discussion_r544710263, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKWAVESUBLQ3YWZUEOXCLTSVFC2XANCNFSM4SLLSI3Q .

Dec 17 '20 00:12 thockin

Watching pods is about the worst thing we can ask people to do.

The simple answer would be to carry the SA namespace/name per-Endpoint. I worry about what the next request and next after that will be...

On Wed, Dec 16, 2020 at 4:12 PM Mikaël Cluseau [email protected] wrote:

@mcluseau commented on this pull request.

In keps/sig-network/2104-reworking-kube-proxy-architecture/README.md:

+## Motivation + + + +### Goals + +- provide a node-level abstraction of the cluster-wide Service semantics through an API +- allow easier, more stable proxy implementations that don't need updates when Service business

um just to clarify one thing: you need the ServiceAccount, endpoint in my API is only a target endpoint. Also, SAs are on pods so the target endpoint can have this info too, and you'll only watch Pods while not having to reimplement Topology.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or unsubscribe.

Dec 17 '20 00:12 thockin

That's why I got on the "let's have the pod info in the endpoint" because it's very linked to it (since it is the pod ^^). The proxy currently has node info in memory, we could had a stripped down pod, I may need it anyway if there's a catch with named ports being potentially different numbers in a same set endpoints (ie: a metrics port differs on the many pods selected).

Le jeu. 17 déc. 2020 à 01:15, Tim Hockin [email protected] a écrit :

Watching pods is about the worst thing we can ask people to do.

The simple answer would be to carry the SA namespace/name per-Endpoint. I worry about what the next request and next after that will be...

On Wed, Dec 16, 2020 at 4:12 PM Mikaël Cluseau [email protected] wrote:

@mcluseau commented on this pull request.

In keps/sig-network/2104-reworking-kube-proxy-architecture/README.md:

+## Motivation + + + +### Goals + +- provide a node-level abstraction of the cluster-wide Service semantics through an API +- allow easier, more stable proxy implementations that don't need updates when Service business

um just to clarify one thing: you need the ServiceAccount, endpoint in my API is only a target endpoint. Also, SAs are on pods so the target endpoint can have this info too, and you'll only watch Pods while not having to reimplement Topology.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or unsubscribe.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/enhancements/pull/2094#issuecomment-747117701, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA5QRODQDQRZ63J2FY3P35DSVFESTANCNFSM4SLLSI3Q .

Dec 17 '20 00:12 mcluseau

@thockin @howardjohn what about a generic map<string, string> SecurityLabels (or just Labels) on endpoints?

Dec 17 '20 00:12 mcluseau

It's possible, but I want to think REALLY hard about how small we can make Endpoints - carrying around a ton of metadata is both very expensive (assuming endpoints change fairly frequently) and adds friction for integration with other xDS providers - they need to provide or at least allow the same metadata.

On Wed, Dec 16, 2020 at 4:37 PM Mikaël Cluseau [email protected] wrote:

@mcluseau commented on this pull request.

In keps/sig-network/2104-reworking-kube-proxy-architecture/README.md https://github.com/kubernetes/enhancements/pull/2094#discussion_r544724409 :

+## Motivation

+

+### Goals

+- provide a node-level abstraction of the cluster-wide Service semantics through an API

+- allow easier, more stable proxy implementations that don't need updates when Service business

let's say we add Endpoint{ EndpointMetadata Metadata }, that Metadata would need:

Namespace

PodName

PodLabels

ServiceAccount

ServiceAccount seems to be the most specific part of it. I feel like any extended manager for services would need to put info on Pod with an admission controller (and also sidecars), if istio annotates the pod with its own istio.security-account={pod.service-account}, would having PodAnnotations in endpoint metadata be enough?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/enhancements/pull/2094#discussion_r544724409, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKWAVBTHHFEL4ZPNUNCWFTSVFHGBANCNFSM4SLLSI3Q .

Dec 17 '20 00:12 thockin

If we go the generic way with only a map<string,string> Metadata, we can have standardized names (namespace,pod, labels/*, annotations/*), we can also have a configuration for the proxy to filter only the required ones. Thus, supporting istio for instance would be a matter of adding what it needs to an include-list (ie, by default, namespace and pod and extend via env-var or even env-var prefix to every subsystem can have its own env-var added to the kube-proxy daemonset)

Le jeu. 17 déc. 2020 à 01:45, Tim Hockin [email protected] a écrit :

It's possible, but I want to think REALLY hard about how small we can make Endpoints - carrying around a ton of metadata is both very expensive (assuming endpoints change fairly frequently) and adds friction for integration with other xDS providers - they need to provide or at least allow the same metadata.

On Wed, Dec 16, 2020 at 4:37 PM Mikaël Cluseau [email protected] wrote:

@mcluseau commented on this pull request.

In keps/sig-network/2104-reworking-kube-proxy-architecture/README.md < https://github.com/kubernetes/enhancements/pull/2094#discussion_r544724409

:

+## Motivation

+

+### Goals

+- provide a node-level abstraction of the cluster-wide Service semantics through an API

+- allow easier, more stable proxy implementations that don't need updates when Service business

let's say we add Endpoint{ EndpointMetadata Metadata }, that Metadata would need:

Namespace

PodName

PodLabels

ServiceAccount

ServiceAccount seems to be the most specific part of it. I feel like any extended manager for services would need to put info on Pod with an admission controller (and also sidecars), if istio annotates the pod with its own istio.security-account={pod.service-account}, would having PodAnnotations in endpoint metadata be enough?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/kubernetes/enhancements/pull/2094#discussion_r544724409 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/ABKWAVBTHHFEL4ZPNUNCWFTSVFHGBANCNFSM4SLLSI3Q

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/enhancements/pull/2094#issuecomment-747130264, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA5QROBUIANL4NB6CWMK7HTSVFIDVANCNFSM4SLLSI3Q .

Dec 17 '20 00:12 mcluseau

or we create a new annotation prefix like endpoints.k8s.io (with "/anything") to avoid any reconfiguration of the proxy.

Dec 17 '20 01:12 mcluseau

I can't comment on the question of which metadata k8s wants to provide here, but I do want to make a quick comment on how that metadata could be encoded.

The EDS proto already has a field for per-endpoint metadata. It looks like it's currently intended for use by the Envoy LB policies, but in principle I don't see any reason it can't also be used for other things.

@mattklein123 @htuch Can we establish a convention of reverse-DNS metadata key names for this field, like we do in other places? That would allow encoding k8s-specific metadata here without the risk of conflicts with other control planes.

Dec 17 '20 15:12 markdroth

Is this metadata planned per endpoint? A lot of scale concerns arise immediately. I would assume that in particular for large services the majority or even all endpoints will share identical labels, service account, namespace, and port. A potential intermediate abstraction could help: Service -> Endpoints (Labels, ServiceAccount, namespace, Port) -> EndpointIPs (IPs).

Dec 17 '20 17:12 tgraf

@markdroth from kubernetes, it will probably be only strings (map<string,string> instead of map<string,Struct>). There's probably a need to clarify why it's called filter_metadata in your Metadata message :-)

@tgraf This metadata is planned per endpoint yes. I don't want to do too much premature optimization, but you're right on the recurrence of the values in there. I'm thinking more about having a metadata identifier (hash of the content), so we store identical metadata to a unique instance, and factorize those to a single pointer in memory. But as we are pushing per-endpoint, I'd be more inclined to test the need of such added "dereferencing" before.

Dec 18 '20 17:12 mcluseau

One question here as someone who is coming late into the conversation:

What is the relationship between EndpointSlice and the endpoint discovery service (eDS)? Right now, each kube-proxy watches all endpoints/endpointslices (EPS) using the k8s watch as transport and provides translation.

Do we want to think about centralizing watching and transport? Is that out of scope?
Should there be translation and interpretation beyond what is in the EPS resource?
Is the translation scoped to the local node? Or is there no such distinction?

/cc @robscott re: EPS and topology

Dec 21 '20 18:12 bowei

Welcome @bowei :)

What is the relationship between EndpointSlice and the endpoint discovery service (eDS)?

We'd like to converge on the formalism, if possible.

Do we want to think about centralizing watching and transport? Is that out of scope?

Yes that's a goal now, kube-proxy will be split in a API to endpoints "watch and convert" process, an optional transport infra to distribute those changes as the cluster admin wants (ie API -> global -> ToR).

Should there be translation and interpretation beyond what is in the EPS resource?

That's still open for kubernetes business' related notions (like node labels and topology), but we plan have a way to communicate (some) labels and annotations down to the node information use by the "node agent" implementation (ie kube-proxy-nftables).

Is the translation scoped to the local node? Or is there no such distinction?

Yes, the goal is to avoid duplication of node-related business logic (topology, maybe network policies too?). Node agents will request the data for their node (by its name) to have a view of what kubernetes wants to express at this level.

Dec 27 '20 19:12 mcluseau

Thanks for the great talk this week at sig-net. I will make another pass thru the KEP text, but I thought it maybe made sense to lay out my thought, since I was one of the louder "library" voices.

This could be a really nice way to break the problem apart. That said, it's a potentially very large (and reusable) change, so I want to make sure we can understand and ingest it incrementally.

Modelling it as a library first allows a few things that seem valuable to me. When I say library, I mean something like:

// main.go

import "sigs.k8s.io/libkp2"

type myImpl struct{}

var _ libkp2.Impl = myImpl{}

func (m myImpl) NewService(ns, name string, ports []libkp2.Port) {
  fmt.Printf("new service %s/%s with %d ports\n", ns, name, len(ports))
}

func (m myImpl) UpdateEndpoints(ns, name string, deltas []libkp2.EndpointDelta) {
  fmt.Printf("endpoint deltas for service %s/%s: %v", na, name, deltas)
}

func main() {
  // This runs forever.  If it returns, something bad happened.
  libkp2.Watch()
}

... the hard parts are in the lib, the main modul just provides the impl.

Why?

I can play with it fairly trivially. I can test it trivially. I can build a monolthic binary that is the equivalent of kube-proxy, but I can also implement it in terms of xDS or my own gRPC or ...

When I need to comprehend or debug it, I don't have to bounce through the network unless that was what I wanted to do.

One of the things that I think is hard to wrap my brain around is the per-node processing (and soon per-zone?) that happens. If I am a node in zone X, I can ask the apiserver for the subset of endpoints designated for zone X. I presume this will preserve that property? Or should we think about shifting the subset responsibility into this library?

If I am doing only-local subsetting, do I have to do that myself, or will I be able to offload this to you, too? How dumb can the final agent get?

I'm excited about opening this door - we have considered it many times, but never acted on it. I don't want to artificially slow it down too much, but I do want to thinkj about how we want to cosnume this and where we draw the boundaries.

Jan 09 '21 00:01 thockin

the important distinction to do, in the current state of the code, is that we have 2 sets of sets: the node-oriented set that tries to be the simplest API possible for the node ; this set is derived from the global set, that contains all the information needed for node-specifics computations (ie topology).

The global set is called proxystore (https://github.com/mcluseau/kube-proxy2/blob/master/pkg/proxystore/proxystore.go) and is a watchable full state. Some source will update it calling store.Update(...), and it can be watched with a loop like:

var rev uint64
for {
  rev = store.View(rev, func(...) { ... })
}

In the case of the real proxy, the controllers are updating the store; in the case of the fake-proxy I used in the talk, it's a file watch that updates the store.

Then, there's a generic diff-store that's used to compute delta from a fully rebuilt arbitrary state and provide updated/deleted elements. This is key to asynchronous non-blocking clients. I've began the work to generalize the "rebuild the target state and get the diff" part in https://github.com/mcluseau/kube-proxy2/blob/master/pkg/server/watchstate/watchstate.go

Currently, all users of watchstate are gRPC endpoints, but I plan to move those out in independent packages, making those gRPC endpoints trivial users of it. The basic loop with be something like that:

ws := watchstate.New(apiv1.ServicesSet, apiv1.EndpointsSet /* ... */)
var rev uint64
for {
  rev = store.View(rev, func() { /* update watch state */ } )

  // query the watch state and do whatever we want

  ws.Reset()
}

This allows arbitrary work on the global store, like pre-filtering topology by zone. The watch state in this case would take a global state and produce another global state. This is also where a converter to xDS would take place. Since we have a store here, that store can come from an arbitrary source too (k8s apiserver, kp2 relay, etc).

After all that global state management, there's the global-to-node part that will need to be put in a specific package (currently only a gRPC endpoint). That specific package would take a proxystore, a node name, and a callback allowing iteration on the full state. It may provide a watchstate too, for lower-level users who do not want the default correlation. The callback will be the same as the current client Run callback: func(items <-chan *ServiceEndpoints) with ServiceEndpoints the correlation of a service with its endpoints in the given node's scope:

type ServiceEndpoints struct {
	Service   *localnetv1.Service
	Endpoints []*localnetv1.Endpoint
}

So, many things are already in place to play with code as a library, even if it's mostly at the "full state" level, but the requirement of a diff-level computation still needs to show its need over the simplicity and expected reliability of the full-state build&diff; for instance, even with 100k endpoints, the nft backend still spends most of its time in the nft call for a few endpoints changes.

As a conclusion, my offer is to provide this full-state processing in a library shape at first and get some feedback from the real world. It's not very far from being done, converges with the library shape wish, and should not slow things down as we'll want this real-world feedback before adding complexity. What's your opinion about this?

Jan 11 '21 17:01 mcluseau

new repo request (suggest your name!) https://github.com/kubernetes/org/issues/2410

Jan 11 '21 19:01 mcluseau

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

Apr 11 '21 20:04 fejta-bot

/remove-lifecycle stale

Apr 15 '21 07:04 mcluseau

Can we define concrete scale targets for endpoints? Like for example, if running under KPNG on ipTables+IpVS both pass conformance or density w a baseline of 5K services w a background of 100 endpoints each, is that "acceptable"?

In general to pass conformance in < 2 hours you need endpoints that can respond within a second or two, so I'm assuming that would be a reasonable "this is good enough" heuristic ....

I know tigera did some research w something akin to that as the baseline .

That way we have a concrete perf delta to target.... as is there's a lot of perf optimization stuff in KPNg and I think getting measurements in place before going further would be easier to manage from a cognitive load perspective :).

Apr 29 '21 12:04 jayunit100

Can we define concrete scale targets for endpoints? Like for example, if running under KPNG on ipTables+IpVS both pass conformance or density w a baseline of 5K services w a background of 100 endpoints each, is that "acceptable"?

There are already some SLOs for scalability here: https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md

I think that the one we want is network programmability latency https://github.com/kubernetes/community/blob/master/sig-scalability/slos/network_programming_latency.md

Apr 30 '21 07:04 aojea

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

Jul 29 '21 07:07 k8s-triage-robot

enhancements
enhancements copied to clipboard

KEP-2104: kube-proxy v2: reworking the proxy's architecture

@mcluseau commented on this pull request.

@howardjohn commented on this pull request.

@mcluseau commented on this pull request.

@mcluseau commented on this pull request.

enhancements enhancements copied to clipboard

KEP-2104: kube-proxy v2: reworking the proxy's architecture

@mcluseau commented on this pull request.

@howardjohn commented on this pull request.

@mcluseau commented on this pull request.

@mcluseau commented on this pull request.

enhancements
enhancements copied to clipboard