helm-charts
helm-charts copied to clipboard
[Enhancement][opensearch][opensearch-dashboards] Lifecycle Probes
Is your feature request related to a problem? Please describe. I think there ought to be liveness/readiness/startup probes for the opensearch statefulset and the opensearch-dashboards deployment.
Currently the pods are immediately marked ready on startup, leading to more than one of the nodes being unavailable which can lead to instability especially with lots of data.
Describe the solution you'd like I am happy to implement something in a PR to resolve this, if that is something the project would want and my general ideas below sound good.
Opensearch: Readiness: query health endpoint, which may require whitelisting or auth, could potentially look for green cluster status Liveness: can check for tcp connection to 9200 or 9300 Startup: can check for health endpoint regardless of status?
Opensearch-dashboards: Readiness: Health is at least yellow in /api/status Liveness: 5601 is open to tcp Startup: same as readiness
Describe alternatives you've considered NA
Additional context I am happy to implement if the ideas sound good.
I completely agree to this and it is kind of compulsory. I also wanted to have something like this included in readiness as well for opensearch apart from the ones mentioned.
That would be clever, do we want to include a sane default, or would we want to leave it disabled until the user chooses to enable it?
That would be clever, do we want to include a sane default, or would we want to leave it disabled until the user chooses to enable it?
What if we can have options to choose between different probes. As in have a basic one and some advanced one as well. Just thinking to make the chart as robust as possible.
Transfer to Helm-Charts repo for new PR #43
Regarding values.yaml
and examples of lifecycle.postStart
There is an example for opensearch
, but none for opensearch-dashboards
pr now.
For opensearch-dashboards
I've implemented this, and it works.
Dashboards takes over 60 seconds to start, depending on your HW of course, but leave plenty of leeway in the TIMEOUT setting.
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "echo $(date +%F-%T) This is the end."]
postStart:
exec:
command:
- bash
- -c
- |
#!/bin/bash
# Wait for the application to start
# NB! Update http / https if you have / don't have https.
# Set a contextroot if you add one in opensearch_dashboards.yml / server.basePath
URL=http://127.0.0.1:5601/<context-root-if-defined>
TIMEOUT=180
#
# Wait until the server responds
#
timeout ${TIMEOUT} bash -c 'while ! [[ "$(curl -s -o /dev/null -w %{http_code}\\n '$URL')" =~ ^[23][0-9][0-9]$ ]]; do sleep 1; done'
(edited comment for clarity)
@sastorsl As far as I know the probes that are added are not present in the statefulset. I will take a look and fix it.
@TheAlgo I see my previous comment was unclear.
There is no example in values.yaml
for opensearch-dashboards
, but the above example addition to values.yaml
works just fine, and the postStart
is added and works quite well.
Ref https://github.com/opensearch-project/helm-charts/blob/main/charts/opensearch-dashboards/templates/deployment.yaml#L130
liveness and readiness probes would be a nice addition though.
@sastorsl do you mind I link this issue to #117 and close this issue when that PR merged?
Thanks.
@sastorsl do you mind I link this issue to #117 and close this issue when that PR merged?
Fine by me, however I'm not OP though...
@sastorsl do you mind I link this issue to #117 and close this issue when that PR merged?
Fine by me, however I'm not OP though...
You guys have similar profile identicons...... @hgoscenski-imanage please check ^^
No worries, I think we cleared this one up nicely :-)
@sastorsl do you mind I link this issue to #117 and close this issue when that PR merged?
What about opensearch chart and liveness and readiness probes? That PR doesn't cover these parts.
Here I want to explain shortly the issue we experienced today with opensearch chart:
After deploying opensearch chart in ansible we request /_cluster/health
to check if number_of_nodes == 3 and status == 'green'
and it failed, because number_of_nodes
was actually 2. When we checked the pods, at first everything looked fine
kubectl get pods
NAME READY STATUS RESTARTS AGE
opensearch-cluster-master-0 1/1 Running 0 7h43m
opensearch-cluster-master-1 1/1 Running 0 7h42m
opensearch-cluster-master-2 1/1 Running 0 7h42m
opensearch-dashboard-opensearch-dashboards-6c4777c7f6-ltn6q 1/1 Running 0 7h40m
but then when we looked at the logs of master pods, we found out that master-2 was not healthy
kubectl logs opensearch-cluster-master-2 -f
Killing opensearch process 9
Killing performance analyzer process 10
We restarted master-2 pod, then number_of_nodes
became 3 and then our deployment finished successfully.
That's why I think we should keep this ticket and implement liveness and readiness probes as well.
@bitnik I agree with you. Readiness is a must for production use cases, it can be a basic one but it is must. It already exists in opensearch chart but it is not properly refactored so is not usable. I will try to fix it.
Any news? :)
Guys? :)