helm-charts icon indicating copy to clipboard operation
helm-charts copied to clipboard

[Enhancement][opensearch][opensearch-dashboards] Lifecycle Probes

Open ghost opened this issue 3 years ago • 15 comments

Is your feature request related to a problem? Please describe. I think there ought to be liveness/readiness/startup probes for the opensearch statefulset and the opensearch-dashboards deployment.

Currently the pods are immediately marked ready on startup, leading to more than one of the nodes being unavailable which can lead to instability especially with lots of data.

Describe the solution you'd like I am happy to implement something in a PR to resolve this, if that is something the project would want and my general ideas below sound good.

Opensearch: Readiness: query health endpoint, which may require whitelisting or auth, could potentially look for green cluster status Liveness: can check for tcp connection to 9200 or 9300 Startup: can check for health endpoint regardless of status?

Opensearch-dashboards: Readiness: Health is at least yellow in /api/status Liveness: 5601 is open to tcp Startup: same as readiness

Describe alternatives you've considered NA

Additional context I am happy to implement if the ideas sound good.

ghost avatar Aug 17 '21 18:08 ghost

I completely agree to this and it is kind of compulsory. I also wanted to have something like this included in readiness as well for opensearch apart from the ones mentioned.

TheAlgo avatar Aug 17 '21 18:08 TheAlgo

That would be clever, do we want to include a sane default, or would we want to leave it disabled until the user chooses to enable it?

ghost avatar Aug 17 '21 18:08 ghost

That would be clever, do we want to include a sane default, or would we want to leave it disabled until the user chooses to enable it?

What if we can have options to choose between different probes. As in have a basic one and some advanced one as well. Just thinking to make the chart as robust as possible.

TheAlgo avatar Aug 17 '21 18:08 TheAlgo

Transfer to Helm-Charts repo for new PR #43

peterzhuamazon avatar Sep 10 '21 20:09 peterzhuamazon

Regarding values.yaml and examples of lifecycle.postStart There is an example for opensearch, but none for opensearch-dashboards pr now.

For opensearch-dashboards I've implemented this, and it works. Dashboards takes over 60 seconds to start, depending on your HW of course, but leave plenty of leeway in the TIMEOUT setting.

lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "echo $(date +%F-%T) This is the end."]
  postStart:
    exec:
      command:
        - bash
        - -c
        - |
          #!/bin/bash
          # Wait for the application to start
          # NB! Update http / https if you have / don't have https.
          # Set a contextroot if you add one in opensearch_dashboards.yml / server.basePath
          URL=http://127.0.0.1:5601/<context-root-if-defined>
          TIMEOUT=180
          #
          # Wait until the server responds
          #
          timeout ${TIMEOUT} bash -c 'while ! [[ "$(curl -s -o /dev/null -w %{http_code}\\n '$URL')" =~ ^[23][0-9][0-9]$ ]]; do sleep 1; done'

(edited comment for clarity)

sastorsl avatar Nov 02 '21 18:11 sastorsl

@sastorsl As far as I know the probes that are added are not present in the statefulset. I will take a look and fix it.

TheAlgo avatar Nov 03 '21 07:11 TheAlgo

@TheAlgo I see my previous comment was unclear. There is no example in values.yaml for opensearch-dashboards, but the above example addition to values.yaml works just fine, and the postStart is added and works quite well.

Ref https://github.com/opensearch-project/helm-charts/blob/main/charts/opensearch-dashboards/templates/deployment.yaml#L130

liveness and readiness probes would be a nice addition though.

sastorsl avatar Nov 03 '21 08:11 sastorsl

@sastorsl do you mind I link this issue to #117 and close this issue when that PR merged?

Thanks.

peterzhuamazon avatar Nov 03 '21 18:11 peterzhuamazon

@sastorsl do you mind I link this issue to #117 and close this issue when that PR merged?

Fine by me, however I'm not OP though...

sastorsl avatar Nov 03 '21 18:11 sastorsl

@sastorsl do you mind I link this issue to #117 and close this issue when that PR merged?

Fine by me, however I'm not OP though...

You guys have similar profile identicons...... @hgoscenski-imanage please check ^^

peterzhuamazon avatar Nov 03 '21 18:11 peterzhuamazon

No worries, I think we cleared this one up nicely :-)

sastorsl avatar Nov 03 '21 19:11 sastorsl

@sastorsl do you mind I link this issue to #117 and close this issue when that PR merged?

What about opensearch chart and liveness and readiness probes? That PR doesn't cover these parts.

Here I want to explain shortly the issue we experienced today with opensearch chart:

After deploying opensearch chart in ansible we request /_cluster/health to check if number_of_nodes == 3 and status == 'green' and it failed, because number_of_nodes was actually 2. When we checked the pods, at first everything looked fine

kubectl get pods
                                                      
NAME                                                          READY   STATUS    RESTARTS   AGE
opensearch-cluster-master-0                                   1/1     Running   0          7h43m
opensearch-cluster-master-1                                   1/1     Running   0          7h42m
opensearch-cluster-master-2                                   1/1     Running   0          7h42m
opensearch-dashboard-opensearch-dashboards-6c4777c7f6-ltn6q   1/1     Running   0          7h40m

but then when we looked at the logs of master pods, we found out that master-2 was not healthy

kubectl logs  opensearch-cluster-master-2 -f

Killing opensearch process 9
Killing performance analyzer process 10

We restarted master-2 pod, then number_of_nodes became 3 and then our deployment finished successfully.

That's why I think we should keep this ticket and implement liveness and readiness probes as well.

bitnik avatar Nov 09 '21 08:11 bitnik

@bitnik I agree with you. Readiness is a must for production use cases, it can be a basic one but it is must. It already exists in opensearch chart but it is not properly refactored so is not usable. I will try to fix it.

TheAlgo avatar Nov 09 '21 10:11 TheAlgo

Any news? :)

sandervandegeijn avatar May 30 '22 21:05 sandervandegeijn

Guys? :)

sandervandegeijn avatar Mar 16 '23 22:03 sandervandegeijn