enhancements icon indicating copy to clipboard operation
enhancements copied to clipboard

KEP-1977: Add KEP for ContainerNotifier

Open xing-yang opened this issue 5 years ago • 44 comments
trafficstars

Enhancement issue: https://github.com/kubernetes/enhancements/issues/1977

xing-yang avatar Sep 20 '20 16:09 xing-yang

CC @yuxiangqian

xing-yang avatar Sep 20 '20 16:09 xing-yang

/assign @thockin @saad-ali @liggitt /assgin @sjenning @derekwaynecarr @dchen1107

xing-yang avatar Sep 20 '20 16:09 xing-yang

/assgin @sjenning @derekwaynecarr @dchen1107

xing-yang avatar Sep 20 '20 16:09 xing-yang

CC @andrewsykim

xing-yang avatar Sep 22 '20 13:09 xing-yang

I would like to see example pod specs (at least the container blocks) for the examples and use cases mentioned.

I understand I should be able to think about what it would look like from the API definition, but just to make it clearer.

It exposes issues like what @thockin noticed here

sjenning avatar Sep 22 '20 16:09 sjenning

CC @ashish-amarnath

xing-yang avatar Sep 23 '20 13:09 xing-yang

Given the open comments should we discuss again? Is this stalled on a decision or on hands-on-keyboards?

@thockin Yes, I think we should discuss again as there are a few open issues. I'll try to set up a meeting.

xing-yang avatar Oct 28 '20 15:10 xing-yang

Given the open comments should we discuss again? Is this stalled on a decision or on hands-on-keyboards?

@thockin there are couple of other things(VolumeSnapshot GA, KubeCon US etc) going on concurrently for both @xing-yang and me, and we should have more spins after this week. I am reworking some pieces of this KEP for a bit to make it more clear, and we will schedule a discussion to talk about the Notification piece. Thanks for your patience.

yuxiangqian avatar Oct 28 '20 16:10 yuxiangqian

NP - I have a fair bit of free time for the next few weeks, so I am circling back to all the KEPs I am tracking

thockin avatar Oct 28 '20 23:10 thockin

1.21 KEP freeze is in O(days) - if this wants review before then, I need updates.

thockin avatar Feb 03 '21 22:02 thockin

Apologies for the late update. We have revisited some of the designs in this KEP to address some of the major concerns raised in the previous version(details in this document).

Major changes:

  1. Resolve the concerns around potential scalability issues when a Notification targets at a large number of Pods(~Ks).
  2. Shrink the scope in phase 1 to reduce implementation complexity and simpler PodNotification API.
  3. Resolve too-deeply nested structure concern.

yuxiangqian avatar Mar 10 '21 08:03 yuxiangqian

/test pull-enhancements-verify

yuxiangqian avatar Mar 10 '21 17:03 yuxiangqian

I am pretty close to happy with this.

thockin avatar Apr 29 '21 23:04 thockin

@yuxiangqian we need to add a file for production readiness here: https://github.com/kubernetes/enhancements/tree/master/keps/prod-readiness/sig-node

xing-yang avatar May 11 '21 03:05 xing-yang

/lgtm

thockin avatar May 14 '21 16:05 thockin

KEP number in kep.yaml should be the enhancement issue number 1977. Fixed it now.

xing-yang avatar May 15 '21 19:05 xing-yang

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Aug 13 '21 20:08 k8s-triage-robot

/remove-lifecycle stale

Phase 1 looks acceptable to me.

The kubelet needs no change; only the pod spec. The kubelet does not need to be notification aware in phase 1.

Completion of phase 1 will allow us to prove out this idea and root out potential issues with the design, both API and performance/scale, before lumping the controller logic into the kubelet where it will be harder to debug and isolate as a source of scaling issues.

sjenning avatar Aug 16 '21 20:08 sjenning

@sjenning thanks for taking another look on this. Like what I mentioned at last week's SIG Node, the new proposal narrowed down the scope a lot based on our offline review meetings and comments. Due to the overwhelm of the reviewers last cycle, we postponed this feature, and hopefully we can finish the phase 1 for 1.23 this cycle.

dchen1107 avatar Aug 17 '21 16:08 dchen1107

I had a number of questions/clarifications I hope we can get through in SIG Node call, otherwise I will paste here right after.

derekwaynecarr avatar Aug 17 '21 17:08 derekwaynecarr

Derek and I talked about this. Some thoughts.

Point in time nature of Notifications. Notification created at time X might behave differently from one created at time X+1, due to the PreExisting SelectorPolicy and Pod creation times i.e. if the PodNotification is created before the Pod, the notification fails. PodNotifications are a one shot thing.

Notification controller watches all pods for CREATE (UPDATE?) to see if the selector matches and then creates PodNotifications on container start? container ready? pod ready? Do pods that have deletionTimestamp set get notifications?

If the pod is deleted, are all PodNotifications that reference the pod deleted too? Is the ownerRef for the PodNotification the pod? or the Notification?

Number of PodNotification resources in the cluster would be the sum of all matching pods over all Notifications, a potentially very large number. How are these cleaned up? Notification controller (i.e. kubelet in the future) would have to WATCH PodNotifications cluster-wide right? I don't see a way to apiserver-side filter the watch to only PodNotifcations referencing a pod a particular kubelet is running. Thus for each change to a PodNotification cluster-wide, a watch event is sent to each kubelet.

QPS for the controller updating all these PodNotification resources will need to be high. Stress on container runtime and apiserver. A potential scaling problems. We have had issues with kubelet QPS in the past delaying pod status updates and leading to all manner of pain.

We have a number of container types including init and ephemeral containers that also use the Container type. Do notifications work for them? Doesn't seem to make sense for those container types.

sjenning avatar Aug 17 '21 17:08 sjenning

I don't see a way to apiserver-side filter the watch to only PodNotifcations referencing a pod a particular kubelet is running. Thus for each change to a PodNotification cluster-wide, a watch event is sent to each kubelet.

I think for this item if we can integrate with the NodeAuthorizer admission plugin, it would be good.

derekwaynecarr avatar Aug 17 '21 18:08 derekwaynecarr

To summarize:

I think this feature needs a garbage collection strategy so we understand the anticipated life time of a PodNotification resource as both a provider/consumer. I would think its reasonable to say that a Pod can have X active PodNotification resources per pod. This helps us budget QPS, plan for handler costs, etc. Completed PodNotification resources must be garbage collected in order to avoid denial of service for both the kubelet (memory growth) and kube-api (similar). TTL on the resource is one way to do that.

Exec/probe costs are really expensive in practice and a key input in understanding node sizing, so having a fixed budget to plan for is helpful. For awareness, pods whose liveness or readiness probes are done with too great of frequency or make heavy use of exec dominate cpu resource usage by the kubelet and runtime. This is why having a limit on the number of concurrent notifications at admission time is helpful.

Adding ContainerNotificationStatus to include ContainerID so it maps to what is shown in PodStatus would let us know if the value is unique across time and space.

When in the pod lifecycle is the pod accepting signals, is a signal ever needed before success of particular probes (liveness/readiness/etc.)? If not, should we scope it to a particular well-defined phase in the Pod lifecycle.

Is there a reason to not explicitly exclude init containers or ephemeral containers as notification targets?

If a pod is undergoing termination, is a signal still sent?

If a notification is created for a terminal pod, who updates the status?

If a notification is created that references a notifier that does not exist in the pod spec, what happens? Does that fail at validation time or runtime?

Should the notification reference a pod by UUID and not just name so it is also bound in time and space?

derekwaynecarr avatar Aug 17 '21 18:08 derekwaynecarr

To summarize:

I think this feature needs a garbage collection strategy so we understand the anticipated life time of a PodNotification resource as both a provider/consumer. I would think its reasonable to say that a Pod can have X active PodNotification resources per pod. This helps us budget QPS, plan for handler costs, etc. Completed PodNotification resources must be garbage collected in order to avoid denial of service for both the kubelet (memory growth) and kube-api (similar). TTL on the resource is one way to do that.

Added a TTLSecondsAfterCompletion field to PodNotificationSpec and NotificationSpec

Exec/probe costs are really expensive in practice and a key input in understanding node sizing, so having a fixed budget to plan for is helpful. For awareness, pods whose liveness or readiness probes are done with too great of frequency or make heavy use of exec dominate cpu resource usage by the kubelet and runtime. This is why having a limit on the number of concurrent notifications at admission time is helpful.

Will address this in a separate commit.

Adding ContainerNotificationStatus to include ContainerID so it maps to what is shown in PodStatus would let us know if the value is unique across time and space.

Added

When in the pod lifecycle is the pod accepting signals, is a signal ever needed before success of particular probes (liveness/readiness/etc.)? If not, should we scope it to a particular well-defined phase in the Pod lifecycle.

Add clarification to the API. A PodNotification should fail immediately if a Pod has PodPhase other than PodPending or PodReady, in case of PodPending, the PodNotification will hold until the Pod is ready.

Is there a reason to not explicitly exclude init containers or ephemeral containers as notification targets? Good point, made it clear in API that InitContaines and EphemeralContainers will be ignored.

If a pod is undergoing termination, is a signal still sent?

Clarified in API, only PodReady and PodPending will be considered. For a pod undergoing termination, I am thinking about scenarios like a container can not be terminated successfully (stuck in some loop for example), and a signal is needed to sent to the container to force kill. Does that sound like a reasonable use case? Another aspect is, only containers still in "Running" state will have the notification executed.

If a notification is created for a terminal pod, who updates the status?

In phase one, that would be the trusted controller, and phase 2 will be Kubelet.

If a notification is created that references a notifier that does not exist in the pod spec, what happens? Does that fail at validation time or runtime?

I assume you mean PodNotification here, I am thinking an admission control to this kinda of validation to avoid the PodNotification from creation. Will address in a separate commit.

Should the notification reference a pod by UUID and not just name so it is also bound in time and space?

Good Point, added UID into PodNotificationStatus

yuxiangqian avatar Aug 23 '21 09:08 yuxiangqian

/assign @johnbelamaric PRR is ready for review.

xing-yang avatar Sep 01 '21 13:09 xing-yang

Given Clayton's comment, it seems unlikely that this will be resolved in time for 1.23 - I am hopeful, but trying to be realisitic.

thockin avatar Sep 03 '21 21:09 thockin

Perhaps this has been discussed, but I have to ask why this needs to be a core API.

It would seem the original motivation - run a quiesce command in each Pod - could be handled via an agent container living in the Pod that is watching a CR. Time may be better spent designing an out-of-tree solution based on a CR and a general agent that can run in a container in the Pod and act locally on those pod containers. Unless there is some reason that we need a privileged host-level process like kubelet to take some action, I can't see a justification for adding this level of complexity for something that can be solved without it.

johnbelamaric avatar Sep 07 '21 21:09 johnbelamaric

/retest

xing-yang avatar Dec 07 '21 20:12 xing-yang

I think I disagree with John on this one, but I admit I am pretty close to it, which colors my perspective. It might be worth trying. For example, using a sidecar to enact your own notifications might be possible - how streamlined could it get?

thockin avatar Dec 21 '21 23:12 thockin

This is on my PRR list but I don't see any updates since pre-1.23. Assume this is a no-go for 1.24?

johnbelamaric avatar Jan 31 '22 21:01 johnbelamaric