bottlerocket icon indicating copy to clipboard operation
bottlerocket copied to clipboard

I regularly have a problem with nodes going NotReady.

Open empath-nirvana opened this issue 2 years ago • 5 comments

Image I'm using: ami-008f64e1aa6e8255a

What I expected to happen:

Kubelet to be stable

What actually happened:

Kublet stops posting updates.

How to reproduce the problem:

I wish I could reproduce at will, but I'll try to describe what happens:

Right before kubelet stops posting update, total cpu will spike up to close to 100% of the available CPU on the node. iowait will also go up, kubelet cpu usage goes up to 17%. There doesn't seem to be any memory pressure, though, memory usage stays stable. On the attached system volumes, read iops also shoot up to basically the limit, and then the node goes unresponsive. SSM also doesn't work. Rebooting the node usually resolves the issue and I can log in and look at logs. What I see in the kubelet logs is a lot of 'context deadline exceeded' errors and not much else. There are logs from containerd that also show 'context deadline exceeded'

empath-nirvana avatar Nov 13 '23 22:11 empath-nirvana

Thanks for opening this issue, I'll take a look into it. Do you mind providing a bit more information about your setup? In particular, it'd be useful to know:

  • What instance type are you using?
  • Are there any notable non-default settings that you are providing?

context deadline exceeded is an error message given when some timeout is reached using the context package. I'm wondering if we hit some unfortunate resource constraints with certain instance types or configurations and fail to meet timeouts. Another possibility could be due to transient issues with the k8s API, but it's hard to say without more details.

cbgbt avatar Nov 15 '23 18:11 cbgbt

I spoke with @stockholmux separately who told me that this is reproducible with m6i and c6i instance families.

@empath-nirvana when I first read this, I assumed that kubelet was failing to start. I'm realizing that you actually mean that it seems to spontaneously stop posting updates after your service has been running for some period of time. Is that right?

Is there any particular resource utilization pattern in your application that seems to trigger this? Separately, is there anything in the systemd journal that could indicate what may be using all of the read IOPS? I wonder if kubelet is starved for disk access.

cbgbt avatar Nov 21 '23 18:11 cbgbt

@cbgbt I'm lurking in the comments here as I saw your comment above around the kubelet failing to start, which is the problem I’m having on EKS with a cluster version of 1.28 and m6i instances. Is there another issue you’d recommend looking at that might shed some light on such issues?

VariableExp0rt avatar Nov 22 '23 07:11 VariableExp0rt

@VariableExp0rt Thanks for letting us know. Since it sounds like this might be a subtly different issue, would you mind opening an issue? It would be very helpful if you could share any relevant kubelet logs or settings that you are changing.

cbgbt avatar Nov 22 '23 17:11 cbgbt