Create Istio owned CNI config
Please provide a description of this PR:
Changes for creating an Istio owned CNI config. Instead of appending the istio-cni plugin to the default CNI config we will copy the CNI config, append the istio-cni plugin and write a new Istio owned config file in the same directory. The config will also be created with a higher priority than the primary CNI. This change helps prevent the race condition described in https://github.com/istio/istio/issues/55968.
Significant Changes/ Open Questions to Consider:
- This PR defines
CNIConfNameto be the primary CNI filename. Previously this value was used to specify the desired CNI config file name to add the istio-cni plugin to. Now, the file will be used to defined the primary CNI filename and path from where to copy the config. - Should there be a plugin equality check between the primary cni plugins and the istio owned config file plugins?
- Should the istio owned cni plugin filename be configurable? Should the file priority be determined programmatically? - UPDATED
- If the istio owned cni plugin file already existed when the cni daemon restarts, can we assume the second highest priority config file is the primary CNI?
- What is the intended behavior of the unchained CNI plugin setting? Should this PR impact the unchained CNI plugin's behavior?
TODO:
- [ ] Add istio config cleanup
- [x] Consider alternative solutions proposed here https://github.com/istio/istio/issues/55968#issuecomment-2845620555. This may be a temporary solution until https://github.com/containernetworking/cni/pull/1052 is available
- [ ] Add more robust testing
- [ ] Add documentation
- [ ] Consider edge cases (istio-cni already defined in primary CNI)
Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all
/test all
Regarding https://github.com/istio/istio/issues/55968#issuecomment-2845620555:
- What happens when you run up against other CNI's that are named
00-*? What if they're000-*? In summary, how does this address the issue where you end up racing with something else to "be first?" - How will consumers know they're save if this is tucked away in a config? What happens if consumers change their infra and there's another CNI that beats out Istio's, which brings the situation back to problem 1 above? How do we avoid silently putting folks back in a state where they can, again silently, be affected by this based on specific triggers?
- At one point @howardjohn mentioned the idea of using init containers as a temporary measure to provide strong guarantees, at least until the CNI changes that are needed are available as a baseline for a patch. Did this discussion go any further?
Considering this problem-space touches security, compliance, and packet flows I feel we should provide a very binary guarantee, even if it's for a temporary implementation, and disallow any holes that 1 and 2 bring to the table. Basically, it should be impossible (provided in an automated manner) for something to come up and "start working" outside the confines of a mesh, especially when it can be silent.
Considering this problem-space touches security, compliance, and packet flows I feel we should provide a very binary guarantee, even if it's for a temporary implementation, and disallow any holes that 1 and 2 bring to the table.
While this is ideal, I'm not sure how possible it is without containernetworking/cni#1052 given that pods can be added to the mesh while they're running. That means that an initcontainer won't work, and it means the hooks that we have post-restart are fairly scarce. I think if this PR allowed the file name to be configurable (like 3) suggests in the PR description), users at least have some control if another CNI has a different name.
How do we avoid silently putting folks back in a state where they can, again silently, be affected by this based on specific triggers?
This is the fundamental problem IMO; the CNI spec provides no controls (not even a file/directory lock!) for the kind of binary control we're looking for (well until containernetworking/cni#1052), so istio-cni doesn't even know it's been preempted post reboot. So, I think this PR gets us as close as we can get for now
That means that an initcontainer won't work
Could it be injected via a webhook, forcing the pod to restart and run the init container when added to the mesh?
so istio-cni doesn't even know it's been preempted post reboot
My understanding was that there's already some code in place to watch for changes to these CNI configs via inotify. Couldn't inotify be used to provide a kind of hook for this part of the problem?
Could it be injected via a webhook, forcing the pod to restart and run the init container when added to the mesh?
There is no event for a node restart. There's no safe action istio can do to force the pod to restart once it's been added to the mesh, and even if we could, that's a lot of disruption and negates one of the primary benefits of ambient (not requiring restarts)
Couldn't inotify be used to provide a kind of hook for this part of the problem?
This is the entire problem; no other pods on the cluster can come up before the primary CNI post-reboot because the primary CNI creates the network for the istio-cni pod (especially now that istio-cni runs out of the host network). During that time, there's nothing istio-cni can do to force the primary CNI to wait for istio-cni to come up. It would be nice, but we can't rely on it as the CNI spec doesn't require it
There is no event for a node restart.
True, I was thinking there'd be some reconciliation (could this be done in the program mentioned at the bottom of this comment that'd run first, outside the CNI network?) tied to the labeling of the workload, and based on that the container would be injected. So on one side you'd have a reconciled approach and on the other side you'd have a webhook to catch anything new coming up. Adding the container as part of reconcilation sucks, but it at least makes the solution correct and reliable, and closes the holes.
that's a lot of disruption and negates one of the primary benefits of ambient (not requiring restarts)
You're right, and the disruption is terrible. I'd argue that the security and compliance holes eclipse that problem. For example, we won't ship to customers with the possibility of ambient mode being enabled due to this issue. A few other folks I know (unrelated to my day job) are backing out of their ambient setups until this is fixed. I see it similar to crypto: if there was a fancy new super safe/fast/compressing cipher but had a hole or two its benefits wouldn't matter since it's fundamentally broken.
because the primary CNI creates the network for the istio-cni pod (especially now that istio-cni runs out of the host network)
Ah, yes you're right. Forgot that istio-cni is not on the host net! There could be something run outside of that network that could come up first, just to do the watch and to send some sort of message about it to istio-cni (when it eventually comes up). I imagine that would be a lot more (throwaway) work.
I think what you're suggesting is feasible @aaronjwood, but I think this PR gets us to about the same place as a practical matter. Anybody installing a new CNI onto a cluster is going to be aware/can learn what that config file's name is. From there, you're a helm upgrade away from changing the istio-cni's owned config file name with this PR. You only need to change it again if there's a change to the existing CNI's config naming scheme or a new CNI is added (not to common from my understanding). I'll also add that this is a stopgap until the new CNI versions get plumbed through to containerd and Kubernetes.
You're right, and the disruption is terrible. I'd argue that the security and compliance holes eclipse that problem. For example, we won't ship to customers with the possibility of ambient mode being enabled due to this issue. A few other folks I know (unrelated to my day job) are backing out of their ambient setups until this is fixed. I see it similar to crypto: if there was a fancy new super safe/fast/compressing cipher but had a hole or two its benefits wouldn't matter since it's fundamentally broken.
I think the benefit of the fix in this PR depends on some of the primary CNI specifics you mentioned above and users' knowledge of how this fix interacts with a primary CNI:
- What happens when you run up against other CNI's that are named 00-? What if they're 000-? In summary, how does this address the issue where you end up racing with something else to "be first?"
- How will consumers know they're save if this is tucked away in a config? What happens if consumers change their infra and there's another CNI that beats out Istio's, which brings the situation back to problem 1 above? How do we avoid silently putting folks back in a state where they can, again silently, be affected by this based on specific triggers?
If there is a CNI that has the requirement of "owning"/controlling the highest priority CNI config and will overwrite any higher priority config then the solution in this PR will never work. I am not sure how often CNI's have this dynamic requirement. In the case where the primary CNI has a fixed priority, this solution works as long as the user properly configures the Istio owned CNI config's priority if necessary (the default of 02 isn't high enough).
At one point @howardjohn mentioned the idea of using init containers as a temporary measure to provide strong guarantees, at least until the CNI changes that are needed are available as a baseline for a patch. Did this discussion go any further?
I agree with this being an alternative, but it modifies the fundamental patterns ambient is based on (not injecting a sidecar or init container into application workloads).
You're right, and the disruption is terrible. I'd argue that the security and compliance holes eclipse that problem. For example, we won't ship to customers with the possibility of ambient mode being enabled due to this issue. A few other folks I know (unrelated to my day job) are backing out of their ambient setups until this is fixed. I see it similar to crypto: if there was a fancy new super safe/fast/compressing cipher but had a hole or two its benefits wouldn't matter since it's fundamentally broken.
I totally hear this though. I think we need to find the balance between a solution that favors complete security and disrupts the existing ambient workflows, and a solution where security is based on the user's proper configuration and knowledge of their underlying CNI and doesn't impact the existing ambient workflow.
I am not sure how often CNI's have this dynamic requirement.
While not "dynamic" I believe Multus has the requirement of being first. Last I looked their config is prefixed with 00. I'm not sure what'll happen if something loads before it, haven't really dug in there. To your point, I'm not sure which other CNIs in the landscape (who knows about closed source ones used inside company x, etc.) would run into this. I'd imagine the whole "starting first" requirement to not be overly rare.
and will overwrite any higher priority config
I'm not sure how the design of CNI holds together for this situation then. For example, if Multus really does have a hard requirement on being first what else can they do besides say "we don't work with anything that tries to get above us" and punt the problem? To @keithmattix's point I totally get that this is all a temporary solution until the CNI change is available for use.
In the case where the primary CNI has a fixed priority, this solution works
This was going to be my next question. Currently we say Istio's CNI works with anything, with this change what'll be the messaging for folks using Multus (or something similar) which requires to be first? This would account for both OSS and proprietary solutions so anything is on the table.
I totally hear this though. I think we need to find the balance between a solution that favors complete security and disrupts the existing ambient workflows, and a solution where security is based on the user's proper configuration and knowledge of their underlying CNI and doesn't impact the existing ambient workflow.
Agreed, in a perfect reality the change to CNI will become available immediately and this whole intermediate state we're in can be skipped :) I understand the situation is a bit tricky logistically and overall quite annoying!
This was going to be my next question. Currently we say Istio's CNI works with anything, with this change what'll be the messaging for folks using Multus (or something similar) which requires to be first? This would account for both OSS and proprietary solutions so anything is on the table.
One thought here is an Istio owned CNI config could be opt in...but if users opt out they're vulnerable (which is definitely not ideal)
While not "dynamic" I believe Multus has the requirement of being first. Last I looked their config is prefixed with 00. I'm not sure what'll happen if something loads before it, haven't really dug in there. To your point, I'm not sure which other CNIs in the landscape (who knows about closed source ones used inside company x, etc.) would run into this. I'd imagine the whole "starting first" requirement to not be overly rare.
I'll do some more investigation here to see which CNIs this solution might be incompatible with.
A few incomplete notes ..
- For multus users I don't think they need any fix , pods in multus explicitly declare which network they need(?) though maybe that doesn't work for ambient...
- For init container I don't think it's bad as an option. It won't work for dynamically enrolled pods but that's largely fine.. if you opt into the extra layer of enforcement then you can handle restarting your pods to onboard. Note if we do this we need to make the repair controller ambient aware.
- Given the high risk of environmental incompatibilities this should definitely be optional. I guess tbd what the default is based on some more investigation thanks for leading this
/test integ-ambient-mc
@howardjohn could you PTAL?
Per discussion in the WG meeting on 6/18 we are moving forward with this temporary solution for 1.27 @howardjohn