design-cfps CFP-27752: Operator Manages Cilium Identities

CFP google doc: https://docs.google.com/document/d/1Hcc_2mB9OOUxrqQgZ-gSYDPnLYE_If_TCzVbUGDOdGM

More discussion can be found in: https://github.com/cilium/design-cfps/pull/19

Sep 13 '24 15:09 dlapcevic

cc @joestringer

Sep 13 '24 15:09 dlapcevic

Status update

The main code changes for operator ID management are merged.

Next steps:

Testing - before we can expect users to adopt operator ID management in 1.17, we want to add more tests for it. We need to check if there are existing e2e tests that create a cluster and verify that pods are running and are able to communicate between each other while adhering to existing network policies. We would either want to use the existing test or create a new test, and enable operator ID management on it.
https://github.com/cilium/cilium/issues/35402
https://github.com/cilium/cilium/issues/34740
https://github.com/cilium/cilium/issues/34865

Oct 15 '24 11:10 dlapcevic

Copy of @tamilmani1989's questions from Cilium Slack about operator ID management.

As operator starts creating cilium identities and then associate with CEP, does cilium agent watch only on CEP/CES to receive updates and not required to watch identity resource?

With current implementation, cilium-agent still needs to watch CID (CiliumIdentity). The way it works is that cilium-operator creates CIDs based on pods and namespaces, and then cilium-agent can assign security identity to CEP when pods are created.

Few notes:

CID might not be needed when CESs are not used, because CEP objects already have security ID and labels. However, this is an inefficiency, which is solved in CESs, where core endpoints within CES are containing only an integer security ID and not labels (which can be very large).
With the previous point in mind, we’re moving in a direction to remove CEP, and keep only CES, where cilium-operator is generating both CID and CES, and doesn’t require any additional input. It can use pod and namespace information.

Today identities are not created for pods that are running in unmanaged node (no cilium agent running). Is operator going to identify if pod scheduled on managed/unmanaged node and create identity only for managed nodes?

This is a very good question to raise. I created an issue for it. We want cilium-operator to create CIDs only for pods that are managed by Cilium.

New issue: https://github.com/cilium/cilium/issues/35402

I guess cilium agent watch on local pods on that node to track label changes. With this feature, is it required for agent to watch on pods?

Watching local pods is required for reasons more than just accessing pod labels. There is HostPort and other functionalities. It would be possible to redesign the network policy enforcement system to not require getting pod labels from the Pod object, but to get it from CES, but there won’t be a lot of benefits there. Local pod watchers are not very costly.

Since it gets rid of identity heartbeat, How does operator track identity has no endpoints and cleanup identity?

The initial implementation is just completed and it still uses the legacy CID GC. This is because the operator managing CIDs is a big architectural change and we wanted to have smaller steps. We’re planning to move CID GC to the new CID controller https://github.com/cilium/cilium/issues/34740.

Operator is keeping data about which pods use which CIDs, and how many use CIDs. At the same time it has access to how CIDs are used in CEPs and CESs. When operator notices that no CIDs are used in Pods, CEPs and CESs, then it will clean them up.

Oct 15 '24 14:10 dlapcevic

then cilium-agent can assign security identity to CEP when pods are created

In doc, It states operator updates CEP status.Identity field. should we update that?
Is there any reason you switched to agent updating identity in CEP instead of operator?
how does agent associate CEP with CID when it receives update from operator. CID object just contains labels?

Apologies to trouble with more questions. would be great if we update doc with this information.

With the previous point in mind, we’re moving in a direction to remove CEP, and keep only CES, where cilium-operator is generating both CID and CES, and doesn’t require any additional input. It can use pod and namespace information.

Did you mean agent would not create CEP at all? If so, what CES will contain then?

Operator is keeping data about which pods use which CIDs, and how many use CIDs. At the same time it has access to how CIDs are used in CEPs and CESs. When operator notices that no CIDs are used in Pods, CEPs and CESs, then it will clean them up.

Why can't operator just track CID -> CEP mappings and if no CEP found for CID, then it can GC. Why does it need to track pod->cid mappings. anyway pod to CEP is 1:1

Oct 18 '24 04:10 tamilmani1989

Apologies to trouble with more questions. would be great if we update doc with this information.

The doc is outdated. I added to the CFP doc and this CFP PR that the doc is outdated and that this CFP PR contains up to date information.

Is there any reason you switched to agent updating identity in CEP instead of operator?

Yes, to avoid operator writing to CEP. It allowed us to go with the simpler implementation of operator CID management and also move into a direction to remove CEP. Operator is creating CIDs from pods and namespaces, and doesn't need CEP.

how does agent associate CEP with CID when it receives update from operator. CID object just contains labels?

cilium-agent without operator CID management gets or creates a CID depending if it exists, when a pod/CEP is created or updated. With operator CID management it only tries to get CID from watcher, but doesn't try to create it. Take a look at the PR https://github.com/cilium/cilium/pull/34867.

Did you mean agent would not create CEP at all? If so, what CES will contain then?

Yes. CES will be created from pods instead of CEPs. We have all needed information in pods and namespaces. This would not be possible without the new CID controller (added with operator CID management), because it contains information about which CID is assigned to which pod.

We already started working on creating CES from pod instead of CEP: https://github.com/cilium/cilium/pull/34784

Why can't operator just track CID -> CEP mappings and if no CEP found for CID, then it can GC. Why does it need to track pod->cid mappings. anyway pod to CEP is 1:1

Operator creates CIDs from pod and namespace data and naturally tracks CID <-> Pod mapping. It doesn't need to depend on CEP/CES, but it can use it to determine if a CID needs to be deleted -- if CES is enabled then looks in CES, otherwise in CEP.

Note: operator still uses legacy CID GC.

Oct 18 '24 08:10 dlapcevic