enhancements Sub-second / More granular probes

Enhancement Description

Allow Probe fields to be specified in seconds or milliseconds.
Kubernetes Enhancement Proposal: draft https://github.com/kubernetes/enhancements/pull/3067
Discussion Link: sig-node discussion notes
Primary contact (assignee): @mikebrow
Responsible SIGs: Sig-Node
Enhancement target (which target equals to which milestone):
- Alpha release target (x.y): 1.29
- Beta release target (x.y):
- Stable release target (x.y):
[ ] Alpha
- [ ] KEP (k/enhancements) update PR(s): https://github.com/kubernetes/enhancements/pull/3067
- [ ] Code (k/k) update PR(s): https://github.com/kubernetes/kubernetes/pull/107958
- [ ] Docs (k/website) update PR(s):

Nov 30 '21 20:11 mikebrow

@mikebrow: The label(s) sig/sig-node cannot be applied, because the repository doesn't have them.

In response to this:

/sig SIG-NODE

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Nov 30 '21 20:11 k8s-ci-robot

@mikebrow: The label(s) sig/sig-node cannot be applied, because the repository doesn't have them.

In response to this:

/sig sig-node

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Nov 30 '21 20:11 k8s-ci-robot

/sig node

Nov 30 '21 20:11 mikebrow

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Feb 28 '22 20:02 k8s-triage-robot

/remove-lifecycle stale

Mar 01 '22 15:03 psschwei

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jul 25 '22 14:07 k8s-triage-robot

/remove-lifecycle stale

Jul 25 '22 18:07 psschwei

/milestone v1.26 /label lead-opted-in (I'm doing this on behalf of @ruiwen-zhao / SIG-node)

Sep 30 '22 18:09 marosset

/stage alpha /label tracked/yes

Oct 01 '22 01:10 rhockenbury

Hey @mikebrow 👋, 1.26 Enhancements team here!

Just checking in as we approach Enhancements Freeze on 18:00 PDT on Thursday 6th October 2022.

This enhancement is targeting for stage alpha for 1.26

Here's where this enhancement currently stands:

[ ] KEP file using the latest template has been merged into the k/enhancements repo.
[X] KEP status is marked as implementable
[ ] KEP has an updated detailed test plan section filled out
[X] KEP has up to date graduation criteria
[ ] KEP has a production readiness review that has been completed and merged into k/enhancements.

For this KEP, we would need to:

The KEP needs updating it's Test Plan Section to incorporate details as stated in the updated detailed test plan
- We need to include the acknowledgement which is missing in this enhancements Test Plan
Get this PR #3067 merged with required changes before Enhancements Freeze to make this enhancement eligible for 1.26 release.

The status of this enhancement is marked as at risk. Please keep the issue description up-to-date with appropriate stages as well. Thank you :)

Oct 03 '22 16:10 Atharva-Shinde

Hello @mikebrow 👋, just a quick check-in again, as we approach the 1.26 Enhancements freeze.

Please plan to get the action items mentioned in my comment above done before Enhancements freeze on 18:00 PDT on Thursday 6th October 2022 i.e tomorrow

For note, the current status of the enhancement is marked at-risk :)

Oct 05 '22 16:10 Atharva-Shinde

Hello 👋, 1.26 Enhancements Lead here.

Unfortunately, this enhancement did not meet requirements for enhancements freeze.

If you still wish to progress this enhancement in v1.26, please file an exception request. Thanks!

/milestone clear /label tracked/no /remove-label tracked/yes /remove-label lead-opted-in

Oct 07 '22 01:10 rhockenbury

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 05 '23 02:01 k8s-triage-robot

/remove-lifecycle stale

Jan 05 '23 13:01 psschwei

this KEP needs to answer how limitation of node resources around sockets would be addressed. See https://github.com/kubernetes/kubernetes/pull/115143 for details.

Jan 25 '23 17:01 SergeyKanzhelev

Hello, I am very interested in this KEP. I happen to also wish for subsecond probes, and I was happy to stumble on this. I see there even is an implementation ! :) I have looked around, and found that in the corresponding discussion, @aojea also pointed this out.

Would enabling SO_REUSEADDR in addition to SO_LINGER(1), as you did, only on probe-related sockets (hence in your new ProbeDialer) be a good idea to address this? In case of ephemeral ports exhaustion, even with a TIME_WAIT state reduced to 1s with your improvement, it could allow the client side (prober) to reuse an existing socket (but with the risk of misinterpreting an old reply hitting a newer probe on a "recycled" ephemeral port)?

On Linux the net.ipv4.tcp_tw_reuse might be used to achieve the same, but this is Linux only.

Mar 07 '23 15:03 Nibelheims

I am very interested in this KEP.

Curious, do you need it for startup, readiness, or liveness probe? Or all of them? What interval are you thinking about?

Mar 08 '23 02:03 SergeyKanzhelev

Hello Sergey, I'd like to have subsecond delays/periods for all kinds of probes, in order to detect a failure as fast as possible. As explained in the KEP's README, the general idea would be to reduce latencies. I do not have a precise value in mind right now, but the current "second scale" is too coarce. Thank you.

Mar 08 '23 09:03 Nibelheims

nod being able to more precisely control the timing is a major part of the KEP and implementation.. If you know it takes 1.2 seconds to start up a DB.. it doesn't make sense to try at 1sec then 2sec or to wait for the 2 sec mark.. Instead maybe it would be better to wait for 1.5seconds? Totally depends on the the model being used and if they can switch to a ready on event push model instead of a state polling model.

Just needs SIG-NODE approval.. timing of this change vs all the other changes keeps pushing it back.. But I think it's ready any time the sigs are ready for it.

Mar 08 '23 14:03 mikebrow

The reason I'm asking is that for liveness probe and partially for readiness probes using streaming instead of pings may work even better. For http it may be some version of a long poll, for gRPC - streaming health service. Streaming may eliminate many scalability concerns. The only thing - it will not work well for startup and for readiness flipping back to Ready. Retrying to establish connection will be easier to do with the same coarseness of 1s+.

Mar 08 '23 17:03 SergeyKanzhelev

Hello, I agree that streaming (in the sense of maintaining live the same socket for each probe?) would be preferable, unfortunately it is not always possible to make the application compliant. One will probably want to use this feature with some existing payload or applications they do not have developped. This would also require its own change in the probing mechanism.

A workaround might be possible by using sidecars: implement stream probes (using the same socket forever) targeting a sidecar which would, at its level, perform sub-second probes into the desired container. The sidecar would handle the persistent connection with the k8s stream probe, and would locally perform sub-second checks. This would move the problem from the kubelet into sidecars, which could look like "dissolving" the network overload. However this may seem overly complicated for a questionable result, since each physical node will still have to deal with more resource consumption.

Mar 09 '23 13:03 Nibelheims

unfortunately it is not always possible to make the application compliant.

I understand this. I am worried about using Node network for subsecond probes. Maybe implementing the probes from the Pod's network or streaming can help with this.

Mar 09 '23 17:03 SergeyKanzhelev

@mikebrow can you update the PR to indicate that you want it for 1.28.

May 05 '23 22:05 SergeyKanzhelev

/label lead-opted-in

Jun 08 '23 07:06 SergeyKanzhelev

/milestone v1.28

@mikebrow mentioned at sig node meeting he wants to see if it can be made to 1.28. Marking for the milestone to not loose it

Jun 09 '23 19:06 SergeyKanzhelev

Hello @mikebrow 👋, Enhancements team here.

Just checking in as we approach enhancements freeze on 01:00 UTC Friday, 16th June 2023.

This enhancement is targeting for stage alpha for 1.28 (correct me, if otherwise)

Here's where this enhancement currently stands:

[ ] KEP readme using the latest template has been merged into the k/enhancements repo.
[ ] KEP status is marked as implementable for latest-milestone: 1.28
[X] KEP readme has a updated detailed test plan section filled out
[ ] KEP readme has up to date graduation criteria
[ ] KEP has a production readiness review that has been completed and merged into k/enhancements.

For this KEP, we would just need to update the following:

The KEP requires to include the updated readme template.
Address questions inside the Production Readiness Review Questionnaire.
Update the latest-milestone in kep.yaml file to 1.28
Update the status to implementable in kep.yaml file.
Update the graduation criteria in the readme.
Ensure that the PRs are merged.

The status of this enhancement is marked as at risk. Please keep the issue description up-to-date with appropriate stages as well. Thank you!

Jun 11 '23 17:06 salehsedghpour

Hi @mikebrow 👋, just checking in before the enhancements freeze on 01:00 UTC Friday, 16th June 2023.

The status for this enhancement is at risk.

For this KEP, we would just need to update the following:

The KEP requires to include the updated readme template.
Address questions inside the Production Readiness Review Questionnaire.
Update the latest-milestone in kep.yaml file to 1.28
Update the status to implementable in kep.yaml file.
Update the graduation criteria in the readme.
Ensure that the PRs are merged.

Let me know if I missed anything. Thanks!

Jun 15 '23 15:06 salehsedghpour

@salehsedghpour

Hi @mikebrow 👋, just checking in before the enhancements freeze on 01:00 UTC Friday, 16th June 2023.

The status for this enhancement is at risk.

For this KEP, we would just need to update the following:

The KEP requires to include the updated readme template. done..

Address questions inside the Production Readiness Review Questionnaire.

done..

Update the latest-milestone in kep.yaml file to 1.28

done..

Update the status to implementable in kep.yaml file.

done needs approval..

Update the graduation criteria in the readme.

done needs approval..

Ensure that the PRs are merged.

thx wip..

Let me know if I missed anything. Thanks!

thank you nothing noted you were very through :-)

Jun 15 '23 22:06 mikebrow

Hello @mikebrow 👋, 1.28 Enhancements Lead here. Unfortunately, this enhancement did not meet requirements for v1.28 enhancements freeze. Feel free to file an exception to add this back to the release tracking process. Thanks!

Jun 16 '23 01:06 Atharva-Shinde

/milestone clear

Jun 16 '23 02:06 Atharva-Shinde

enhancements enhancements copied to clipboard

Sub-second / More granular probes

Enhancement Description

enhancements
enhancements copied to clipboard