enhancements
enhancements copied to clipboard
Sub-second / More granular probes
Enhancement Description
- Allow Probe fields to be specified in seconds or milliseconds.
- Kubernetes Enhancement Proposal: draft https://github.com/kubernetes/enhancements/pull/3067
- Discussion Link: sig-node discussion notes
- Primary contact (assignee): @mikebrow
- Responsible SIGs: Sig-Node
- Enhancement target (which target equals to which milestone):
- Alpha release target (x.y): 1.29
- Beta release target (x.y):
- Stable release target (x.y):
- [ ] Alpha
- [ ] KEP (
k/enhancements
) update PR(s): https://github.com/kubernetes/enhancements/pull/3067 - [ ] Code (
k/k
) update PR(s): https://github.com/kubernetes/kubernetes/pull/107958 - [ ] Docs (
k/website
) update PR(s):
- [ ] KEP (
@mikebrow: The label(s) sig/sig-node
cannot be applied, because the repository doesn't have them.
In response to this:
/sig SIG-NODE
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@mikebrow: The label(s) sig/sig-node
cannot be applied, because the repository doesn't have them.
In response to this:
/sig sig-node
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/sig node
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
/milestone v1.26 /label lead-opted-in (I'm doing this on behalf of @ruiwen-zhao / SIG-node)
/stage alpha /label tracked/yes
Hey @mikebrow 👋, 1.26 Enhancements team here!
Just checking in as we approach Enhancements Freeze on 18:00 PDT on Thursday 6th October 2022.
This enhancement is targeting for stage alpha
for 1.26
Here's where this enhancement currently stands:
- [ ] KEP file using the latest template has been merged into the k/enhancements repo.
- [X] KEP status is marked as
implementable
- [ ] KEP has an updated detailed test plan section filled out
- [X] KEP has up to date graduation criteria
- [ ] KEP has a production readiness review that has been completed and merged into k/enhancements.
For this KEP, we would need to:
- The KEP needs updating it's Test Plan Section to incorporate details as stated in the updated detailed test plan
- We need to include the acknowledgement which is missing in this enhancements Test Plan
- Get this PR #3067 merged with required changes before Enhancements Freeze to make this enhancement eligible for 1.26 release.
The status of this enhancement is marked as at risk
. Please keep the issue description up-to-date with appropriate stages as well.
Thank you :)
Hello @mikebrow 👋, just a quick check-in again, as we approach the 1.26 Enhancements freeze.
Please plan to get the action items mentioned in my comment above done before Enhancements freeze on 18:00 PDT on Thursday 6th October 2022 i.e tomorrow
For note, the current status of the enhancement is marked at-risk
:)
Hello 👋, 1.26 Enhancements Lead here.
Unfortunately, this enhancement did not meet requirements for enhancements freeze.
If you still wish to progress this enhancement in v1.26, please file an exception request. Thanks!
/milestone clear /label tracked/no /remove-label tracked/yes /remove-label lead-opted-in
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
this KEP needs to answer how limitation of node resources around sockets would be addressed. See https://github.com/kubernetes/kubernetes/pull/115143 for details.
Hello, I am very interested in this KEP. I happen to also wish for subsecond probes, and I was happy to stumble on this. I see there even is an implementation ! :) I have looked around, and found that in the corresponding discussion, @aojea also pointed this out.
Would enabling SO_REUSEADDR
in addition to SO_LINGER(1)
, as you did, only on probe-related sockets (hence in your new ProbeDialer
) be a good idea to address this? In case of ephemeral ports exhaustion, even with a TIME_WAIT
state reduced to 1s with your improvement, it could allow the client side (prober) to reuse an existing socket (but with the risk of misinterpreting an old reply hitting a newer probe on a "recycled" ephemeral port)?
On Linux the net.ipv4.tcp_tw_reuse
might be used to achieve the same, but this is Linux only.
I am very interested in this KEP.
Curious, do you need it for startup, readiness, or liveness probe? Or all of them? What interval are you thinking about?
Hello Sergey, I'd like to have subsecond delays/periods for all kinds of probes, in order to detect a failure as fast as possible. As explained in the KEP's README, the general idea would be to reduce latencies. I do not have a precise value in mind right now, but the current "second scale" is too coarce. Thank you.
nod being able to more precisely control the timing is a major part of the KEP and implementation.. If you know it takes 1.2 seconds to start up a DB.. it doesn't make sense to try at 1sec then 2sec or to wait for the 2 sec mark.. Instead maybe it would be better to wait for 1.5seconds? Totally depends on the the model being used and if they can switch to a ready on event push model instead of a state polling model.
Just needs SIG-NODE approval.. timing of this change vs all the other changes keeps pushing it back.. But I think it's ready any time the sigs are ready for it.
The reason I'm asking is that for liveness probe and partially for readiness probes using streaming instead of pings may work even better. For http it may be some version of a long poll, for gRPC - streaming health service. Streaming may eliminate many scalability concerns. The only thing - it will not work well for startup and for readiness flipping back to Ready. Retrying to establish connection will be easier to do with the same coarseness of 1s+.
Hello, I agree that streaming (in the sense of maintaining live the same socket for each probe?) would be preferable, unfortunately it is not always possible to make the application compliant. One will probably want to use this feature with some existing payload or applications they do not have developped. This would also require its own change in the probing mechanism.
A workaround might be possible by using sidecars: implement stream probes (using the same socket forever) targeting a sidecar which would, at its level, perform sub-second probes into the desired container. The sidecar would handle the persistent connection with the k8s stream probe, and would locally perform sub-second checks. This would move the problem from the kubelet into sidecars, which could look like "dissolving" the network overload. However this may seem overly complicated for a questionable result, since each physical node will still have to deal with more resource consumption.
unfortunately it is not always possible to make the application compliant.
I understand this. I am worried about using Node network for subsecond probes. Maybe implementing the probes from the Pod's network or streaming can help with this.
@mikebrow can you update the PR to indicate that you want it for 1.28.
/label lead-opted-in
/milestone v1.28
@mikebrow mentioned at sig node meeting he wants to see if it can be made to 1.28. Marking for the milestone to not loose it
Hello @mikebrow 👋, Enhancements team here.
Just checking in as we approach enhancements freeze on 01:00 UTC Friday, 16th June 2023.
This enhancement is targeting for stage alpha
for 1.28 (correct me, if otherwise)
Here's where this enhancement currently stands:
- [ ] KEP readme using the latest template has been merged into the k/enhancements repo.
- [ ] KEP status is marked as
implementable
forlatest-milestone: 1.28
- [X] KEP readme has a updated detailed test plan section filled out
- [ ] KEP readme has up to date graduation criteria
- [ ] KEP has a production readiness review that has been completed and merged into k/enhancements.
For this KEP, we would just need to update the following:
- The KEP requires to include the updated readme template.
- Address questions inside the Production Readiness Review Questionnaire.
- Update the
latest-milestone
inkep.yaml
file to1.28
- Update the status to implementable in
kep.yaml
file. - Update the graduation criteria in the readme.
- Ensure that the PRs are merged.
The status of this enhancement is marked as at risk
. Please keep the issue description up-to-date with appropriate stages as well. Thank you!
Hi @mikebrow 👋, just checking in before the enhancements freeze on 01:00 UTC Friday, 16th June 2023.
The status for this enhancement is at risk
.
For this KEP, we would just need to update the following:
- The KEP requires to include the updated readme template.
- Address questions inside the Production Readiness Review Questionnaire.
- Update the
latest-milestone
inkep.yaml
file to1.28
- Update the status to implementable in
kep.yaml
file. - Update the graduation criteria in the readme.
- Ensure that the PRs are merged.
Let me know if I missed anything. Thanks!
@salehsedghpour
Hi @mikebrow 👋, just checking in before the enhancements freeze on 01:00 UTC Friday, 16th June 2023.
The status for this enhancement is
at risk
.For this KEP, we would just need to update the following:
- The KEP requires to include the updated readme template. done..
- Address questions inside the Production Readiness Review Questionnaire.
done..
- Update the
latest-milestone
inkep.yaml
file to1.28
done..
- Update the status to implementable in
kep.yaml
file.
done needs approval..
- Update the graduation criteria in the readme.
done needs approval..
- Ensure that the PRs are merged.
thx wip..
Let me know if I missed anything. Thanks!
thank you nothing noted you were very through :-)
Hello @mikebrow 👋, 1.28 Enhancements Lead here. Unfortunately, this enhancement did not meet requirements for v1.28 enhancements freeze. Feel free to file an exception to add this back to the release tracking process. Thanks!
/milestone clear