cilium Operator manages identities

Feature: Operator Manages Cilium Identities

CFP: https://github.com/cilium/cilium/issues/27752

Description

A new feature hidden behind a flag. Disabled by default.

Besides the new flag that enables the feature, there are no other user visible changes.

cilium-operator is going to manage (create and delete) Cilium Identity API objects instead of cilium-agent.

cilium-operator calculates the desired state for Cilium Identities based on watched events for Cilium Identities, Pods, Namespaces and Cilium Endpoint Slices (if enabled).

cilium-operator creates Cilium Identities on pod updates for a unique label set based on pod and namespace labels.

cilium-operator deletes Cilium Identities when their labels are not used by any pods.

cilium-agent no longer writes to Cilium Identities.

cilium-agent only watches Cilium Identities.

pod creation (cilium-cni) is blocked on getting a matching Cilium Identity from the watcher cache.

Feature: Operator Manages Cilium Identities
A new feature hidden behind a flag. Disabled by default.
cilium-operator is going to manage (create and delete) Cilium Identity API objects instead of cilium-agent.

kind/feature

Signed-off-by: Dorde Lapcevic <[email protected]>

Jan 22 '24 17:01 dlapcevic

Commit 4be0482f59e1a0f7c83ac247c53af0d05543a1f6 does not match "(?m)^Signed-off-by:".

Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin

Jan 22 '24 17:01 maintainer-s-little-helper[bot]

cc @alan-kut

Jan 31 '24 10:01 dlapcevic

cc @joestringer

Jan 31 '24 12:01 dlapcevic

Thanks for the contribution! This looks like a lot of work. At my initial pass, I will just take a look generally at the structure and how we can make this review process work for everyone. I hope you'll understand that a PR of this size needs a certain level of consideration as to how we communicate about the changes before we get into the deep details and line-by-line review of the overall change. As long as we are in principle aligned on the core idea and have a fair bit of alignment on the CFP, I think we should be able to break this down into smaller, more manageable hunks and integrate this gradually into the codebase. I think that at in general, the operator should have this capability to allocate Identity objects, so we should have broad alignment. The rest will be up to the technical details/discussions and ensuring that this process raises any potential concerns and provides adequate space for us to discuss them.

Does this PR now match the latest state of the CFP (issue and/or google doc) or are there any notable differences to consider that you would like to point out? Last that I recall there were a couple of details that we were not fully aligned on, but I would have to brush up my memory on the latest for those. For now is this just "Phase 1" as described in the linked issue?

Looking from a high level, I see this is almost 5,000 lines of code churn in a single patch. Usually in the Cilium project we aim for maybe 200 or less per patch, and ideally under a thousand total per PR. Ensuring that the changes are introduced in logical steps as separate patches can significantly improve the review process. For instance, CNCF / LFX tracks Review Time By Pull Request Size and you can see that time to complete a change varies by as much as 3x longer for a PR that is 500+ lines compared with just a few lines. This is another order of magnitude larger than even the high end of those statistics. While I recognise that it can take additional effort from a contributor such as yourself to break down the changes into individual logical patches, I think that this can have several benefits that will ultimately reduce the time to complete this work compared with a monolithic approach: First, as you work to split up the changes, it provides a way for you to perform self-review and notice any unusual code patterns or areas for improvement. The more that you find yourself, the less that you are relying on reviewers to catch and provide feedback on those areas. Secondly, reviewers can more easily chunk the review into manageable pieces. Personally I struggle to reason about more than 1-200 lines of code at a time. If I can accept that certain patches are good and move on, then it provides a useful way to stage the review (eg, I'll review a couple of commits, move on to other tasks, come back and revisit subsequent commits from where I left off). If we consider the converse, if there is a lot of code to track and reviewers lack of confidence around understanding the PR, decisions made and how that impacts the longer term health of the codebase, then that can lead reviewers to delay, defer, or otherwise avoid providing approval around the changes. Overall I would expect that structuring the PR clearly and working with reviewers to ensure the changes are understood will help all of us to land the change into the tree quicker, which will free you up to focus on subsequent efforts. Happy to engage in dialogue about the line there since it's very fuzzy, but I think we can find a middle ground where we can balance the burden of developing / formatting the changes with the review and ensuring the quality bar is met.

One thing to consider would be whether there is some refactoring occurring in the PR which could be split out and submitted separately as preparatory PR(s) for improving general Cilium code health before the main part of the PR goes in. Another principle we apply is that a single commit should either have functional changes or cosmetic / code organization changes. Mixing the two makes it more complicated to understand the impact of each change. At different layers of the code, there is often also ways to provide a certain facility / functionality without the layer above consuming or requiring that logic. For instance, to add a well-abstracted library at a lower layer, then incrementally introduce the logic on how that library interacts with other parts of the programs, then add CLI flags, docs, Helm updates etc. Furthermore, by splitting into individual commits, this can provide a good mechanism to track and document decisions that were made and context about "why" certain decisions were made during the writing of the code. Typically we will include as much "why" as possible in the commit messages. This means that in future if there is an issue or someone is debugging how this logic works, they can browse in a local git repository, look through git log, git blame, etc. and find the context necessary to reason about how they should make subsequent changes. This has saved us from making bad decisions for bugfixes many times.

Applying some of these principles to the direct properties of this PR, I can see some examples:

pkg/identity/{basicallocator,nonglobal} - Without looking at the details, I suspect what's going on here is that the identity management abstraction was not well-suited to this newer model for managing identities. One critical thing is to ensure that whatever the new code does, it does not break the existing operation of the current code. I assume given that cilium-agent currently does allocation with some libraries in this area of the codebase, this is a rewrite or refactor of the existing functionality. Focusing on how that change by itself works without delivering the entirety of this PR would be a great first step. I'll admit though even the pkg/identity changes are already pretty large---2K overall---so if we can break it down further into smaller PRs then even that can help. I also note I didn't yet look at the details of how that integrates with the rest of Cilium either.
operator/pkg/ciliumendpointslice - I notice there's a few changes to test code here. They seem like relatively simple refactoring that could be in their own PR. Even if that's only ~125 lines different, that would be 125 lines that don't need to be blocked on the rest of this broader PR, won't need rebasing, etc. etc.
Clearly the changes under operator/pkg/ciliumidentity are the most significant in the PR, and that's understandable even just knowing the context of the overall PR title and goal. Do I understand correctly that there is intent there to modernize the way that the operator handles these resources, so rather than reusing/refactoring existing code to implement how cilium-operator pulls the info & handles it, the intent is for that to be a complete rewrite? The reason I ask is just that this is a design decision, and can have impacts on whether for instance the allocation logic in cilium-agent has different bugs than the implementation in cilium-operator and so on. I'm open to the idea that this is the right tradeoff, but whatever tradeoffs are being made there, I think it's worth documenting those in the commit messages.

The PR description does a good job of describing some of the invariants / statements about expected behaviour. Thanks for that outline. If you have any further context about the structure of the PR, any design decisions that were made, or even diagrams then I'm sure the reviewers would appreciate that as well. Not strictly necessary, but just to get reviewers on the same page it can help to clearly outline your expectations, any focus points you're not sure about etc. so that reviewers who are not so familiar with the CFP can learn.

Finally, we may want to think a bit also about some of the broad aspects like:

How do users configure this? I don't see any Helm changes currently.
Do we need to update some of the documentation, either for users or for developers to outline architectural changes?
What does it look like to have comprehensive testing for the new functionality?

Jan 31 '24 21:01 joestringer

One thing that I think is playing a role with the pkg/identity changes is that currently the assumption is that all identities (both clusterwide and node-local) are allocated through one interface through one part of the code. So part of this would be splitting that out such that the cilium-agent allocates node-local identities and avoids allocation for cluster-wide identities. I think that this as a dedicated problem is worth reasoning through. How does the cilium-agent interact with the identity subsystem in the architecture when the cluster-wide identity allocation [in the per-node cilium-agent] is an optional feature. Logically I could see this being introduced as a dedicated PR as a dependency for the broader feature here.

a temporary security identity (temp id) will be created and used locally by the agent, until it's replaced by a global identity (Cilium Identity).

Ah, I think this was one of the sticking points during the CFP review. Maybe we can find a way to specifically dig into that aspect and come to a consensus on it. My recollection of this specific aspect was that I was concerned about whether the benefits of this piece of logic outweighed the additional complexity. This also raised some lifecycle & testing questions.

Jan 31 '24 22:01 joestringer

Thank you very much @joestringer for all the suggestions. I agree with you. I'll work with the reviewers to make the review process as seamless as possible.

I understand the points that you make about the design and trade offs. I'll answer all of your questions soon, across separate commits and PRs.

I already started with refactoring CES test utils, that can clearly be a separate PR: https://github.com/cilium/cilium/pull/30577

I'll keep this PR open for now, until all the code is moved to separate PRs.

Feb 01 '24 12:02 dlapcevic

Implement Basic Identity Allocator PR is ready: https://github.com/cilium/cilium/pull/30602

Feb 02 '24 14:02 dlapcevic

[agent] Implement Temporary Security Identity Allocation PR is ready: https://github.com/cilium/cilium/pull/30605

Feb 02 '24 21:02 dlapcevic

[agent] Export and refactor parts of identity cache allocator PR is ready: https://github.com/cilium/cilium/pull/30633

Feb 05 '24 17:02 dlapcevic

[operator] Cilium Identity Controller - Operator manages CIDs PR is ready for review: https://github.com/cilium/cilium/pull/30719

Feb 12 '24 16:02 dlapcevic

Given that this PR is primarily available for now to demonstrate the overall solution that you are driving towards, I think it is a useful example. I see that the description references the CFP and mentions the different aspects of the problem, and the latest GitHub breadcrumbs provide links to active subsets of this PR. I'll mark this as draft for now, as we should focus on reviewing the smaller pieces of this PR first and get them in, then later we can look at whether to rebase the remainder of this PR and lift it back out of draft. Thank you.

Feb 23 '24 19:02 joestringer

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

May 15 '24 01:05 github-actions[bot]

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Jun 29 '24 01:06 github-actions[bot]

This pull request has not seen any activity since it was marked stale. Closing.

Jul 13 '24 01:07 github-actions[bot]

Operator manages identities - full implementation