mangle icon indicating copy to clipboard operation
mangle copied to clipboard

Mangle 3.0 Stability - Cassandra DB goes down suddenly

Open Anvesh42 opened this issue 3 years ago • 6 comments

Environment: OpenShift v4.6.36 Kubernetes Version: v1.19.0 Mangle Version: 3.0 Issue:

  1. Cassandra DB goes down with failed connections causing mangle POD to do multiple retries on the cassandra DB
  2. Mangle product UI is not available for this entire duration

Interim Solution Being Followed:

  1. Restart Cassandra POD
  2. Restart mangle POD
  3. Increase the resource limits on cassandra statefulset template as recommended by the mangle team during working session.

Previous:

 - resources:
       limits:
           cpu: '1'
           memory: 8Gi
        requests:
           cpu: '500m'
           memory: 2Gi

Current:

 - resources:
       limits:
           cpu: '2'
           memory: 8Gi
        requests:
           cpu: '1'
           memory: 4Gi

Frequency Of This Issue: Once every few weeks. Typically 7-8 weeks but it may be random too.

Logs:

  1. Please find the attached logs from mangle & cassandra POD's when this issue downtime happened recently in the last week of February, 2022

cassandra_pod_failure_0227.txt mangle_pod_failure_0227.txt

Deployment Templates:

  1. Please find the attached cassandra statefulset & mangle deployment template resource cassandra_statefulset_template.txt mangle_deployment_template.txt

Anvesh42 avatar Mar 15 '22 15:03 Anvesh42

Hi @Anvesh42 Let us know on the stability of the cassandra pod after increasing the resource limits.

rpraveen-vmware avatar Mar 16 '22 06:03 rpraveen-vmware

@rpraveen-vmware I have increased the resources on the Cassandra configuration as discussed during our session. I shall monitor it for few days and observe the stability. Thanks!

Anvesh42 avatar Mar 16 '22 16:03 Anvesh42

@ashrimalivmware @rpraveen-vmware Even after increasing the resource limits (as stated above), the cassandra POD still goes down. Attaching the latest log cassandra_04182022.txt .

Anvesh42 avatar Apr 18 '22 18:04 Anvesh42

@Anvesh42 What is the frequency of cassandra pod going down now with the increased resource limits..? cc: @ashrimalivmware

rpraveen-vmware avatar Apr 20 '22 14:04 rpraveen-vmware

@ashrimalivmware @rpraveen-vmware Can you please share the docker files for mangle & Cassandra that were used to build these standard images?

In regards to Cassandra POD stability, I am looking at options to explore/enhance the possible solution for this.

Thanks Anvesh

Anvesh42 avatar Nov 22 '22 21:11 Anvesh42

@ashrimalivmware @rpraveen-vmware

In continuation to previous query in the same thread, we would like get some insights into the modifications that we can do to prevent cassandra POD from going down frequently. Please let us know. Details provided below.

Cassandra POD resources & ENV values:

image

Latest Cassandra Failure Log:

cassandra-0-1214.log

We also observe that the standard cassandra.yaml provided by Vmware doesn't have liveness probe. Could that be one the reasons?

Thanks Anvesh

Anvesh42 avatar Dec 14 '22 16:12 Anvesh42