kubernetes icon indicating copy to clipboard operation
kubernetes copied to clipboard

fix 104592 termination log causes nodes to run out of inodes on filesystem

Open pacoxu opened this issue 2 years ago • 6 comments

What type of PR is this?

/kind bug /sig node

What this PR does / why we need it:

This pr is based on #104632

For each start of the pod, kubelet will mount a termination log to the container with path /dev/termination-log. On k8s clusters where many pods are stuck in CrashLoopBackOff state, kubelet will create millions of these files, which can lead to nodes running out of inodes on a filesystem where /var/lib/kubelet is located.

Which issue(s) this PR fixes:

Fixes #104592

Special notes for your reviewer:

When kubelet removes evictable containers, it will remove container logs (code here), but omitting the termination log. Is the previous termination log useful for pod failure analysis? The termination log is for message log right before the container exits. Its size is limited to 4096 bytes. The drop of containers also includes the removal of the log files, which I think are more helpful for troubleshooting. The last termination log message has been stored in the lastState of status.containerStatuses. We can send request to api-server and ask lastState's message for further analysis. After the removal of the container, the bound termination log is out of kubernetes's scope. Since the container it bound to has already been removed. It is left alone in the /var/lib/kubelet/pods//containers/ before the pod gets dropped, as well as the folder /var/lib/kubelet/pods/.

Does this PR introduce a user-facing change?

kubelet: delete termination message log files on container eviction to prevent running out of inodes

/cc @matthyx @akshaysharama /assign @mrunalp

pacoxu avatar Aug 09 '22 10:08 pacoxu

@pacoxu: GitHub didn't allow me to request PR reviews from the following users: akshaysharama.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

What type of PR is this?

/kind bug /sig node

What this PR does / why we need it:

This pr is based on #104632

For each start of the pod, kubelet will mount a termination log to the container with path /dev/termination-log. On k8s clusters where many pods are stuck in CrashLoopBackOff state, kubelet will create millions of these files, which can lead to nodes running out of inodes on a filesystem where /var/lib/kubelet is located.

Which issue(s) this PR fixes:

Fixes #104592

Special notes for your reviewer:

When kubelet removes evictable containers, it will remove container logs (code here), but omitting the termination log. Is the previous termination log useful for pod failure analysis? The termination log is for message log right before the container exits. Its size is limited to 4096 bytes. The drop of containers also includes the removal of the log files, which I think are more helpful for troubleshooting. The last termination log message has been stored in the lastState of status.containerStatuses. We can send request to api-server and ask lastState's message for further analysis. After the removal of the container, the bound termination log is out of kubernetes's scope. Since the container it bound to has already been removed. It is left alone in the /var/lib/kubelet/pods//containers/ before the pod gets dropped, as well as the folder /var/lib/kubelet/pods/.

Does this PR introduce a user-facing change?

kubelet: fix termination log causes nodes to run out of inodes on the filesystem

/cc @matthyx @akshaysharama /assign @mrunalp

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Aug 09 '22 10:08 k8s-ci-robot

/priority important-soon /triage accepted

pacoxu avatar Aug 10 '22 03:08 pacoxu

/lgtm Maybe we could add something more meaningful to the release note? Or maybe document somewhere that this is an improvement on the cleanup necessary on nodes?

matthyx avatar Aug 11 '22 06:08 matthyx

kubelet: improve the cleanup of termination log files on nodes to prevent running out of inodes

Is this better?

pacoxu avatar Aug 11 '22 07:08 pacoxu

Maybe something like: kubelet: delete termination message log files on container eviction to prevent running out of inodes

matthyx avatar Aug 11 '22 08:08 matthyx

Updated.

/assign @mrunalp @dchen1107

pacoxu avatar Aug 24 '22 08:08 pacoxu

@pacoxu needs a rebase :(

dims avatar Oct 19 '22 13:10 dims

New changes are detected. LGTM label has been removed.

k8s-ci-robot avatar Oct 20 '22 02:10 k8s-ci-robot

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: matthyx, pacoxu Once this PR has been reviewed and has the lgtm label, please ask for approval from dchen1107 by writing /assign @dchen1107 in a comment. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Oct 20 '22 02:10 k8s-ci-robot

@pacoxu needs a rebase :(

Rebased. Thanks for reminding me.

pacoxu avatar Oct 20 '22 02:10 pacoxu

/retest

dims avatar Oct 20 '22 13:10 dims

@pacoxu: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-unit 6b7761593f0d06fd284ea9c09ba0c158b6f95f81 link true /test pull-kubernetes-unit

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-ci-robot avatar Nov 06 '22 06:11 k8s-ci-robot

If you still need this PR then please rebase, if not, please close the PR

dims avatar Dec 12 '22 15:12 dims

This PR has the label work-in-progress, please revisit to see if you still need this, please close if not

dims avatar Dec 12 '22 15:12 dims

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: matthyx, pacoxu Once this PR has been reviewed and has the lgtm label, please ask for approval from dchen1107. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Mar 30 '23 05:03 k8s-ci-robot

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jun 28 '23 10:06 k8s-triage-robot

/remove-lifecycle stale /cc @SergeyKanzhelev @bart0sh

pacoxu avatar Jul 04 '23 07:07 pacoxu

/assign @bobbypage

bart0sh avatar Jul 23 '23 12:07 bart0sh

/hold please review https://github.com/kubernetes/kubernetes/pull/121181 as @Chaunceyctx is tracking this.

pacoxu avatar Oct 16 '23 03:10 pacoxu

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 22 '24 10:01 k8s-triage-robot

/remove-lifecycle stale

pacoxu avatar Feb 18 '24 06:02 pacoxu

Closing this as this fix is being tracked in #121181

/close

AnishShah avatar Mar 06 '24 18:03 AnishShah

@AnishShah: You can't close an active issue/PR unless you authored it or you are a collaborator.

In response to this:

Closing this as this fix is being tracked in #121181

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Mar 06 '24 18:03 k8s-ci-robot