kubernetes
kubernetes copied to clipboard
fix 104592 termination log causes nodes to run out of inodes on filesystem
What type of PR is this?
/kind bug /sig node
What this PR does / why we need it:
This pr is based on #104632
For each start of the pod, kubelet will mount a termination log to the container with path /dev/termination-log. On k8s clusters where many pods are stuck in CrashLoopBackOff state, kubelet will create millions of these files, which can lead to nodes running out of inodes on a filesystem where /var/lib/kubelet is located.
Which issue(s) this PR fixes:
Fixes #104592
Special notes for your reviewer:
When kubelet removes evictable containers, it will remove container logs (code here), but omitting the termination log. Is the previous termination log useful for pod failure analysis? The termination log is for message log right before the container exits. Its size is limited to 4096 bytes. The drop of containers also includes the removal of the log files, which I think are more helpful for troubleshooting. The last termination log message has been stored in the lastState of status.containerStatuses. We can send request to api-server and ask lastState's message for further analysis. After the removal of the container, the bound termination log is out of kubernetes's scope. Since the container it bound to has already been removed. It is left alone in the /var/lib/kubelet/pods/
/containers/ before the pod gets dropped, as well as the folder /var/lib/kubelet/pods/ .
Does this PR introduce a user-facing change?
kubelet: delete termination message log files on container eviction to prevent running out of inodes
/cc @matthyx @akshaysharama /assign @mrunalp
@pacoxu: GitHub didn't allow me to request PR reviews from the following users: akshaysharama.
Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.
In response to this:
What type of PR is this?
/kind bug /sig node
What this PR does / why we need it:
This pr is based on #104632
For each start of the pod, kubelet will mount a termination log to the container with path /dev/termination-log. On k8s clusters where many pods are stuck in CrashLoopBackOff state, kubelet will create millions of these files, which can lead to nodes running out of inodes on a filesystem where /var/lib/kubelet is located.
Which issue(s) this PR fixes:
Fixes #104592
Special notes for your reviewer:
When kubelet removes evictable containers, it will remove container logs (code here), but omitting the termination log. Is the previous termination log useful for pod failure analysis? The termination log is for message log right before the container exits. Its size is limited to 4096 bytes. The drop of containers also includes the removal of the log files, which I think are more helpful for troubleshooting. The last termination log message has been stored in the lastState of status.containerStatuses. We can send request to api-server and ask lastState's message for further analysis. After the removal of the container, the bound termination log is out of kubernetes's scope. Since the container it bound to has already been removed. It is left alone in the /var/lib/kubelet/pods/
/containers/ before the pod gets dropped, as well as the folder /var/lib/kubelet/pods/ . Does this PR introduce a user-facing change?
kubelet: fix termination log causes nodes to run out of inodes on the filesystem
/cc @matthyx @akshaysharama /assign @mrunalp
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/priority important-soon /triage accepted
/lgtm Maybe we could add something more meaningful to the release note? Or maybe document somewhere that this is an improvement on the cleanup necessary on nodes?
kubelet: improve the cleanup of termination log files on nodes to prevent running out of inodes
Is this better?
Maybe something like:
kubelet: delete termination message log files on container eviction to prevent running out of inodes
Updated.
/assign @mrunalp @dchen1107
@pacoxu needs a rebase :(
New changes are detected. LGTM label has been removed.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: matthyx, pacoxu
Once this PR has been reviewed and has the lgtm label, please ask for approval from dchen1107 by writing /assign @dchen1107
in a comment. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment
@pacoxu needs a rebase :(
Rebased. Thanks for reminding me.
/retest
@pacoxu: The following test failed, say /retest
to rerun all failed tests or /retest-required
to rerun all mandatory failed tests:
Test name | Commit | Details | Required | Rerun command |
---|---|---|---|---|
pull-kubernetes-unit | 6b7761593f0d06fd284ea9c09ba0c158b6f95f81 | link | true | /test pull-kubernetes-unit |
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.
If you still need this PR then please rebase, if not, please close the PR
This PR has the label work-in-progress, please revisit to see if you still need this, please close if not
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: matthyx, pacoxu Once this PR has been reviewed and has the lgtm label, please ask for approval from dchen1107. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the PR is closed
You can:
- Mark this PR as fresh with
/remove-lifecycle stale
- Close this PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale /cc @SergeyKanzhelev @bart0sh
/assign @bobbypage
/hold please review https://github.com/kubernetes/kubernetes/pull/121181 as @Chaunceyctx is tracking this.
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.
This bot triages PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the PR is closed
You can:
- Mark this PR as fresh with
/remove-lifecycle stale
- Close this PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
Closing this as this fix is being tracked in #121181
/close
@AnishShah: You can't close an active issue/PR unless you authored it or you are a collaborator.
In response to this:
Closing this as this fix is being tracked in #121181
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.