csm icon indicating copy to clipboard operation
csm copied to clipboard

[BUG]: CrashLoopBackOff and OOMKilled issue in pod : Dell CSM Operator Manager POD

Open jpand15 opened this issue 11 months ago • 14 comments

Bug Description

Hi Team, customer is experiencing CSM Operator Controller POD restarts due to OOMKilled error, they applied the increase in the memory limit from 256mi to 512mi but still experiencing restarts.

Logs

CSM logs.txt

Screenshots

No response

Additional Environment Information

No response

Steps to Reproduce

Customer tried redeploying and still issue arises. Also experiencing same error in different site.

Expected Behavior

CSM Pods should not be restarting

CSM Driver(s)

Dell CSM Operator 1.3.0

Installation Type

CSM Operator

Container Storage Modules Enabled

No response

Container Orchestrator

Openshift 4.12

Operating System

Redhat Core OS

jpand15 avatar Mar 21 '24 13:03 jpand15

/sync

atye avatar Mar 21 '24 15:03 atye

/sync

hoppea2 avatar Mar 25 '24 12:03 hoppea2

link: 22097

csmbot avatar Mar 25 '24 12:03 csmbot

Hi Team, is there workaround to increase memory limit to higher than 512mi?

jpand15 avatar Apr 08 '24 06:04 jpand15

@jpand15 -- sorry for the late response. Looked into this yesterday with @daniyaliqbal2024, and we have not been able to replicate the issue on an internal 4.12 setup. There were no details provided on what the operator is managing, but based on the logs, it looks to me like it is csi-powerstore with resiliency enabled -- is that correct? If you could provide the following, that would be helpful:

Output of oc get csm -A Output of oc get csm <name> -n <namespace> -o yaml for each CSM object on the cluster A resource utilization graph from OpenShift for the operator similar to this one that a customer provided in a previous escalation.

Also, is the customer's openshift setup setting memory limits for k8s nodes that are causing the OOM killer to run because the node is running past its memory limit? One thing I noticed is that the resource request for the operator is set to192 Mib, and, based on K8s documentation, this is the value that K8s uses to help select what node to schedule the operator on. Is it possible that the operator is getting scheduled on a node that has 192Mib of memory available, but not 512Mib? This could cause the OOM killer to kill the operator not because it is exceeding 512Mib, but because it is exceeding memory on a node that had 192 Mib but not 256 or 512 Mib available? You could experiment with this by increasing the request from 192 to 512 as a starting point. Also, if you are able to provide that resource metrics graph requested above, it would help clarify this.

We would be happy to jump on a call with the customer if that helps -- we have not been able to replicate this locally, so it would be interesting to see the setup where the issue is occurring.

jooseppi-luna avatar Apr 10 '24 19:04 jooseppi-luna

Also, to answer your question -- you can update the operator.yaml file with different resource limits/requests and reinstall with that.

jooseppi-luna avatar Apr 10 '24 19:04 jooseppi-luna

Hi Team, thank you for your reply, have requested the customer to try out increasing the request memory limit from 192M to 512Mi will update of the findings. Also attached are the requested logs.

CitadelProd1.zip DavaoProd1.zip

Citadel Site image

Davao Site image

jpand15 avatar Apr 12 '24 11:04 jpand15

@jpand15 thanks for the update -- unfortunately, I'm having a little trouble understanding the graphs. Can you give me an example timestamp for one of them where an OOM kill happens? To me everything looks pretty steady -- I'm not seeing a sudden drop/jump in usage.

jooseppi-luna avatar Apr 12 '24 13:04 jooseppi-luna

Hi Jooseppi customer have just updated that they are still experiencing CSM Pod restarts after increasing the memory request limit to 512mi.

image

jpand15 avatar Apr 16 '24 13:04 jpand15

@jpand15 thanks for the update. As we are unable to reproduce this error locally, I'm not sure how much more I can offer in terms of suggestions. If you'd like, I'm happy to jump on a call with the customer and take a closer look at the problem and get a better understanding of their setup.

jooseppi-luna avatar Apr 16 '24 14:04 jooseppi-luna

Sure I'll try to setup a call with them what is your available time in your timezone, we are in the APJ region btw.

jpand15 avatar Apr 16 '24 14:04 jpand15

@jpand15 I can be free as early as 5:30 AM Eastern Daylight Time (I don't mind getting up early). Feel free to ping me on slack or email me at [email protected] and we can find a time!

jooseppi-luna avatar Apr 16 '24 14:04 jooseppi-luna

Copy @jooseppi-luna will check with customer for their availability and advise you of the schedule.

jpand15 avatar Apr 16 '24 14:04 jpand15

@panigs7 and I met with the customer this morning to get a better understanding of their setup and the problem. Here is what we found:

  • The customer has a pre-prod setup with 2 master nodes and ~13 worker nodes. Increasing the memory limit to 512 on this system stopped the operator restarts. The resting memory consumption of the operator is around 350 Mib.
  • The customer's production setup has 49 worker nodes. Raising the memory limit to 512 on this setup may have helped make the restarts less frequent (this was unclear) but it didn't eliminate them. The resting memory consumption of this instance is also around 350 Mib.
  • There is no sign of a memory leak. The memory usage is fairly constant over the long term – there is no evidence of the memory creeping up to 512 Mib and then getting killed and returning to a lower level, only to rise again. This makes it look like there is an occasional spike in memory demand that exceeding 512 Mib that is causing the operator to restart. Since the refresh period on the memory usage in metrics in 15 seconds, it's very possible that these events are being missed in the usage graph.
  • The overall memory consumption on the node that the operator is running on is very low, so the operator is not getting killed because the node as a whole is running out of memory.

Since the operator is processing events from all of the pods associated with the deployment, it makes sense that a system with a larger number of nodes (and therefore pods) would require higher memory limits. Based on this, and the fact that 512 Mib was enough on the smaller system but not the larger one, we recommended that the customer raise the memory limit to 1024 Mib. It could be that the operator just requires more memory on systems with a high number of nodes.

@jpand15 will follow up with the customer to see if raising the memory limit to 1024Mib fixes the issue.

In the future, we may also want to investigate this on our own to determine if we can replicate this issue, and if this memory usage is acceptable or excessive. If it is excessive, we should identify ways to limit it.

jooseppi-luna avatar Apr 23 '24 15:04 jooseppi-luna

@jpand15 : Please reopen the issue if it still persists

shanmydell avatar May 09 '24 05:05 shanmydell