enhancements
enhancements copied to clipboard
KEP-1977: Add KEP for ContainerNotifier
Enhancement issue: https://github.com/kubernetes/enhancements/issues/1977
CC @yuxiangqian
/assign @thockin @saad-ali @liggitt /assgin @sjenning @derekwaynecarr @dchen1107
/assgin @sjenning @derekwaynecarr @dchen1107
CC @andrewsykim
I would like to see example pod specs (at least the container blocks) for the examples and use cases mentioned.
I understand I should be able to think about what it would look like from the API definition, but just to make it clearer.
It exposes issues like what @thockin noticed here
CC @ashish-amarnath
Given the open comments should we discuss again? Is this stalled on a decision or on hands-on-keyboards?
@thockin Yes, I think we should discuss again as there are a few open issues. I'll try to set up a meeting.
Given the open comments should we discuss again? Is this stalled on a decision or on hands-on-keyboards?
@thockin there are couple of other things(VolumeSnapshot GA, KubeCon US etc) going on concurrently for both @xing-yang and me, and we should have more spins after this week. I am reworking some pieces of this KEP for a bit to make it more clear, and we will schedule a discussion to talk about the Notification piece. Thanks for your patience.
NP - I have a fair bit of free time for the next few weeks, so I am circling back to all the KEPs I am tracking
1.21 KEP freeze is in O(days) - if this wants review before then, I need updates.
Apologies for the late update. We have revisited some of the designs in this KEP to address some of the major concerns raised in the previous version(details in this document).
Major changes:
- Resolve the concerns around potential scalability issues when a Notification targets at a large number of Pods(~Ks).
- Shrink the scope in phase 1 to reduce implementation complexity and simpler PodNotification API.
- Resolve too-deeply nested structure concern.
/test pull-enhancements-verify
I am pretty close to happy with this.
@yuxiangqian we need to add a file for production readiness here: https://github.com/kubernetes/enhancements/tree/master/keps/prod-readiness/sig-node
/lgtm
KEP number in kep.yaml should be the enhancement issue number 1977. Fixed it now.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
Phase 1 looks acceptable to me.
The kubelet needs no change; only the pod spec. The kubelet does not need to be notification aware in phase 1.
Completion of phase 1 will allow us to prove out this idea and root out potential issues with the design, both API and performance/scale, before lumping the controller logic into the kubelet where it will be harder to debug and isolate as a source of scaling issues.
@sjenning thanks for taking another look on this. Like what I mentioned at last week's SIG Node, the new proposal narrowed down the scope a lot based on our offline review meetings and comments. Due to the overwhelm of the reviewers last cycle, we postponed this feature, and hopefully we can finish the phase 1 for 1.23 this cycle.
I had a number of questions/clarifications I hope we can get through in SIG Node call, otherwise I will paste here right after.
Derek and I talked about this. Some thoughts.
Point in time nature of Notifications. Notification created at time X might behave differently from one created at time X+1, due to the PreExisting SelectorPolicy and Pod creation times i.e. if the PodNotification is created before the Pod, the notification fails. PodNotifications are a one shot thing.
Notification controller watches all pods for CREATE (UPDATE?) to see if the selector matches and then creates PodNotifications on container start? container ready? pod ready? Do pods that have deletionTimestamp set get notifications?
If the pod is deleted, are all PodNotifications that reference the pod deleted too? Is the ownerRef for the PodNotification the pod? or the Notification?
Number of PodNotification resources in the cluster would be the sum of all matching pods over all Notifications, a potentially very large number. How are these cleaned up? Notification controller (i.e. kubelet in the future) would have to WATCH PodNotifications cluster-wide right? I don't see a way to apiserver-side filter the watch to only PodNotifcations referencing a pod a particular kubelet is running. Thus for each change to a PodNotification cluster-wide, a watch event is sent to each kubelet.
QPS for the controller updating all these PodNotification resources will need to be high. Stress on container runtime and apiserver. A potential scaling problems. We have had issues with kubelet QPS in the past delaying pod status updates and leading to all manner of pain.
We have a number of container types including init and ephemeral containers that also use the Container type. Do notifications work for them? Doesn't seem to make sense for those container types.
I don't see a way to apiserver-side filter the watch to only PodNotifcations referencing a pod a particular kubelet is running. Thus for each change to a PodNotification cluster-wide, a watch event is sent to each kubelet.
I think for this item if we can integrate with the NodeAuthorizer admission plugin, it would be good.
To summarize:
I think this feature needs a garbage collection strategy so we understand the anticipated life time of a PodNotification resource as both a provider/consumer. I would think its reasonable to say that a Pod can have X active PodNotification resources per pod. This helps us budget QPS, plan for handler costs, etc. Completed PodNotification resources must be garbage collected in order to avoid denial of service for both the kubelet (memory growth) and kube-api (similar). TTL on the resource is one way to do that.
Exec/probe costs are really expensive in practice and a key input in understanding node sizing, so having a fixed budget to plan for is helpful. For awareness, pods whose liveness or readiness probes are done with too great of frequency or make heavy use of exec dominate cpu resource usage by the kubelet and runtime. This is why having a limit on the number of concurrent notifications at admission time is helpful.
Adding ContainerNotificationStatus to include ContainerID so it maps to what is shown in PodStatus would let us know if the value is unique across time and space.
When in the pod lifecycle is the pod accepting signals, is a signal ever needed before success of particular probes (liveness/readiness/etc.)? If not, should we scope it to a particular well-defined phase in the Pod lifecycle.
Is there a reason to not explicitly exclude init containers or ephemeral containers as notification targets?
If a pod is undergoing termination, is a signal still sent?
If a notification is created for a terminal pod, who updates the status?
If a notification is created that references a notifier that does not exist in the pod spec, what happens? Does that fail at validation time or runtime?
Should the notification reference a pod by UUID and not just name so it is also bound in time and space?
To summarize:
I think this feature needs a garbage collection strategy so we understand the anticipated life time of a PodNotification resource as both a provider/consumer. I would think its reasonable to say that a Pod can have X active PodNotification resources per pod. This helps us budget QPS, plan for handler costs, etc. Completed PodNotification resources must be garbage collected in order to avoid denial of service for both the kubelet (memory growth) and kube-api (similar). TTL on the resource is one way to do that.
Added a TTLSecondsAfterCompletion field to PodNotificationSpec and NotificationSpec
Exec/probe costs are really expensive in practice and a key input in understanding node sizing, so having a fixed budget to plan for is helpful. For awareness, pods whose liveness or readiness probes are done with too great of frequency or make heavy use of exec dominate cpu resource usage by the kubelet and runtime. This is why having a limit on the number of concurrent notifications at admission time is helpful.
Will address this in a separate commit.
Adding ContainerNotificationStatus to include ContainerID so it maps to what is shown in PodStatus would let us know if the value is unique across time and space.
Added
When in the pod lifecycle is the pod accepting signals, is a signal ever needed before success of particular probes (liveness/readiness/etc.)? If not, should we scope it to a particular well-defined phase in the Pod lifecycle.
Add clarification to the API. A PodNotification should fail immediately if a Pod has PodPhase other than PodPending or PodReady, in case of PodPending, the PodNotification will hold until the Pod is ready.
Is there a reason to not explicitly exclude init containers or ephemeral containers as notification targets? Good point, made it clear in API that InitContaines and EphemeralContainers will be ignored.
If a pod is undergoing termination, is a signal still sent?
Clarified in API, only PodReady and PodPending will be considered. For a pod undergoing termination, I am thinking about scenarios like a container can not be terminated successfully (stuck in some loop for example), and a signal is needed to sent to the container to force kill. Does that sound like a reasonable use case? Another aspect is, only containers still in "Running" state will have the notification executed.
If a notification is created for a terminal pod, who updates the status?
In phase one, that would be the trusted controller, and phase 2 will be Kubelet.
If a notification is created that references a notifier that does not exist in the pod spec, what happens? Does that fail at validation time or runtime?
I assume you mean PodNotification here, I am thinking an admission control to this kinda of validation to avoid the PodNotification from creation. Will address in a separate commit.
Should the notification reference a pod by UUID and not just name so it is also bound in time and space?
Good Point, added UID into PodNotificationStatus
/assign @johnbelamaric PRR is ready for review.
Given Clayton's comment, it seems unlikely that this will be resolved in time for 1.23 - I am hopeful, but trying to be realisitic.
Perhaps this has been discussed, but I have to ask why this needs to be a core API.
It would seem the original motivation - run a quiesce command in each Pod - could be handled via an agent container living in the Pod that is watching a CR. Time may be better spent designing an out-of-tree solution based on a CR and a general agent that can run in a container in the Pod and act locally on those pod containers. Unless there is some reason that we need a privileged host-level process like kubelet to take some action, I can't see a justification for adding this level of complexity for something that can be solved without it.
/retest
I think I disagree with John on this one, but I admit I am pretty close to it, which colors my perspective. It might be worth trying. For example, using a sidecar to enact your own notifications might be possible - how streamlined could it get?
This is on my PRR list but I don't see any updates since pre-1.23. Assume this is a no-go for 1.24?