kubernetes fix 104592 termination log causes nodes to run out of inodes on filesystem

fix 104592 termination log causes nodes to run out of inodes on filesystem

Open pacoxu opened this issue 2 years ago • 6 comments

What type of PR is this?

/kind bug /sig node

What this PR does / why we need it:

This pr is based on #104632

For each start of the pod, kubelet will mount a termination log to the container with path /dev/termination-log. On k8s clusters where many pods are stuck in CrashLoopBackOff state, kubelet will create millions of these files, which can lead to nodes running out of inodes on a filesystem where /var/lib/kubelet is located.

Which issue(s) this PR fixes:

Fixes #104592

Special notes for your reviewer:

When kubelet removes evictable containers, it will remove container logs (code here), but omitting the termination log. Is the previous termination log useful for pod failure analysis? The termination log is for message log right before the container exits. Its size is limited to 4096 bytes. The drop of containers also includes the removal of the log files, which I think are more helpful for troubleshooting. The last termination log message has been stored in the lastState of status.containerStatuses. We can send request to api-server and ask lastState's message for further analysis. After the removal of the container, the bound termination log is out of kubernetes's scope. Since the container it bound to has already been removed. It is left alone in the /var/lib/kubelet/pods//containers/ before the pod gets dropped, as well as the folder /var/lib/kubelet/pods/.

Does this PR introduce a user-facing change?

kubelet: delete termination message log files on container eviction to prevent running out of inodes

/cc @matthyx @akshaysharama /assign @mrunalp

Aug 09 '22 10:08 pacoxu

@pacoxu: GitHub didn't allow me to request PR reviews from the following users: akshaysharama.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

What type of PR is this?

/kind bug /sig node

What this PR does / why we need it:

This pr is based on #104632

For each start of the pod, kubelet will mount a termination log to the container with path /dev/termination-log. On k8s clusters where many pods are stuck in CrashLoopBackOff state, kubelet will create millions of these files, which can lead to nodes running out of inodes on a filesystem where /var/lib/kubelet is located.

Which issue(s) this PR fixes:

Fixes #104592

Special notes for your reviewer:

When kubelet removes evictable containers, it will remove container logs (code here), but omitting the termination log. Is the previous termination log useful for pod failure analysis? The termination log is for message log right before the container exits. Its size is limited to 4096 bytes. The drop of containers also includes the removal of the log files, which I think are more helpful for troubleshooting. The last termination log message has been stored in the lastState of status.containerStatuses. We can send request to api-server and ask lastState's message for further analysis. After the removal of the container, the bound termination log is out of kubernetes's scope. Since the container it bound to has already been removed. It is left alone in the /var/lib/kubelet/pods//containers/ before the pod gets dropped, as well as the folder /var/lib/kubelet/pods/.

Does this PR introduce a user-facing change?
kubelet: fix termination log causes nodes to run out of inodes on the filesystem
/cc @matthyx @akshaysharama /assign @mrunalp

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Aug 09 '22 10:08 k8s-ci-robot

/priority important-soon /triage accepted

Aug 10 '22 03:08 pacoxu

/lgtm Maybe we could add something more meaningful to the release note? Or maybe document somewhere that this is an improvement on the cleanup necessary on nodes?

Aug 11 '22 06:08 matthyx

kubelet: improve the cleanup of termination log files on nodes to prevent running out of inodes

Is this better?

Aug 11 '22 07:08 pacoxu

Maybe something like: kubelet: delete termination message log files on container eviction to prevent running out of inodes

Aug 11 '22 08:08 matthyx

Updated.

/assign @mrunalp @dchen1107

Aug 24 '22 08:08 pacoxu

@pacoxu needs a rebase :(

Oct 19 '22 13:10 dims

New changes are detected. LGTM label has been removed.

Oct 20 '22 02:10 k8s-ci-robot

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: matthyx, pacoxu Once this PR has been reviewed and has the lgtm label, please ask for approval from dchen1107 by writing /assign @dchen1107 in a comment. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/kubelet/OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Oct 20 '22 02:10 k8s-ci-robot

@pacoxu needs a rebase :(

Rebased. Thanks for reminding me.

Oct 20 '22 02:10 pacoxu

/retest

Oct 20 '22 13:10 dims

@pacoxu: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kubernetes-unit	6b7761593f0d06fd284ea9c09ba0c158b6f95f81	link	true	`/test pull-kubernetes-unit`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Nov 06 '22 06:11 k8s-ci-robot

If you still need this PR then please rebase, if not, please close the PR

Dec 12 '22 15:12 dims

This PR has the label work-in-progress, please revisit to see if you still need this, please close if not

Dec 12 '22 15:12 dims

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: matthyx, pacoxu Once this PR has been reviewed and has the lgtm label, please ask for approval from dchen1107. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/kubelet/OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Mar 30 '23 05:03 k8s-ci-robot

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jun 28 '23 10:06 k8s-triage-robot

/remove-lifecycle stale /cc @SergeyKanzhelev @bart0sh

Jul 04 '23 07:07 pacoxu

/assign @bobbypage

Jul 23 '23 12:07 bart0sh

/hold please review https://github.com/kubernetes/kubernetes/pull/121181 as @Chaunceyctx is tracking this.

Oct 16 '23 03:10 pacoxu

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 22 '24 10:01 k8s-triage-robot

/remove-lifecycle stale

Feb 18 '24 06:02 pacoxu

Closing this as this fix is being tracked in #121181

/close

Mar 06 '24 18:03 AnishShah

@AnishShah: You can't close an active issue/PR unless you authored it or you are a collaborator.

In response to this:

Closing this as this fix is being tracked in #121181

/close

Mar 06 '24 18:03 k8s-ci-robot

kubernetes kubernetes copied to clipboard

fix 104592 termination log causes nodes to run out of inodes on filesystem

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

kubernetes
kubernetes copied to clipboard