core
core copied to clipboard
relax ES checks in /api/v1/probes/startup
Problem Statement
/api/v1/probes/startup returns 503 in situations when it should not. We are not sure of the details. It seems to occur if there is not an Active index, and possibly if the Active index is empty.
We use this for the startupProbe in kubernetes to determine when a new pod is healthy. Until it succeeds, the new pod is not fully in service. If the server is running a long-running reindex, the pod will be terminated by kubernetes while the reindex is in process, as the reindex time will exceed the default startup timeout we set for this probe. On a cluster environment, the first pod needs to be healthy before k8s will start other pods in the cluster.
Steps to Reproduce
Delete all indexes, restart dotCMS. /api/v1/probes/startup will return 503 until there is an Active index. Something like that anyway.
Acceptance Criteria
Not sure. Perhaps /api/v1/probes/startup can verify that Elasticsearch is reachable, without checking on the health of the Active index.
dotCMS Version
LTS and current
Proposed Objective
Reliability
Proposed Priority
Priority 2 - Important
External Links... Slack Conversations, Support Tickets, Figma Designs, etc.
https://dotcms.slack.com/archives/C072VF6R2JC/p1724294214346659
Assumptions & Initiation Needs
No response
Quality Assurance Notes & Workarounds
No response
Sub-Tasks & Estimates
No response
Discussion of this with @wezell https://dotcms.slack.com/archives/C06TM536N9J/p1724357695544169
NOTE TO QA/DEVS:
The updated JSON response coming from the probe is:
{
"backendHealthy": true,
"dotCMSHealthy": true,
"frontendHealthy": true,
"subsystems": {
"assetFSHealthy": true,
"cacheHealthy": true,
"dbSelectHealthy": true,
"esHealthy": true,
"localFSHealthy": true
}
}
Notice how there's only one ES-related check now: esHealthy
Passed IQA:
- Bringing down Opensearch/ES makes the endpoint
{{serverURL}}/api/v1/probes/startupto return 503 without any response body - Bringing back up Opensearch/ES makes the endpoint
{{serverURL}}/api/v1/probes/startupto return 200 with a full JSON response
Approved: Tested on trunk_ab4b1df, Docker, macOS 14.5, FF v126.0.1
Internal QA: PASSED
- Docker Image :
trunk_49902f3
The ES Health Check must return true under these circumstances:
- The index is present and active.
- The index is de-activated.
- The index doesn't exist at all.
And will return the expected 503 when the ES Server is down completely.
Approved: Tested on ** trunk_702df61**, Docker, macOS 14.5, FF v126.0.1