core icon indicating copy to clipboard operation
core copied to clipboard

relax ES checks in /api/v1/probes/startup

Open yolabingo opened this issue 1 year ago • 1 comments

Problem Statement

/api/v1/probes/startup returns 503 in situations when it should not. We are not sure of the details. It seems to occur if there is not an Active index, and possibly if the Active index is empty.

We use this for the startupProbe in kubernetes to determine when a new pod is healthy. Until it succeeds, the new pod is not fully in service. If the server is running a long-running reindex, the pod will be terminated by kubernetes while the reindex is in process, as the reindex time will exceed the default startup timeout we set for this probe. On a cluster environment, the first pod needs to be healthy before k8s will start other pods in the cluster.

Steps to Reproduce

Delete all indexes, restart dotCMS. /api/v1/probes/startup will return 503 until there is an Active index. Something like that anyway.

Acceptance Criteria

Not sure. Perhaps /api/v1/probes/startup can verify that Elasticsearch is reachable, without checking on the health of the Active index.

dotCMS Version

LTS and current

Proposed Objective

Reliability

Proposed Priority

Priority 2 - Important

External Links... Slack Conversations, Support Tickets, Figma Designs, etc.

https://dotcms.slack.com/archives/C072VF6R2JC/p1724294214346659

Assumptions & Initiation Needs

No response

Quality Assurance Notes & Workarounds

No response

Sub-Tasks & Estimates

No response

yolabingo avatar Aug 22 '24 19:08 yolabingo

Discussion of this with @wezell https://dotcms.slack.com/archives/C06TM536N9J/p1724357695544169

yolabingo avatar Aug 23 '24 14:08 yolabingo

NOTE TO QA/DEVS:

The updated JSON response coming from the probe is:

{
    "backendHealthy": true,
    "dotCMSHealthy": true,
    "frontendHealthy": true,
    "subsystems": {
        "assetFSHealthy": true,
        "cacheHealthy": true,
        "dbSelectHealthy": true,
        "esHealthy": true,
        "localFSHealthy": true
    }
}

Notice how there's only one ES-related check now: esHealthy

jcastro-dotcms avatar Sep 10 '24 21:09 jcastro-dotcms

Passed IQA:

  • Bringing down Opensearch/ES makes the endpoint {{serverURL}}/api/v1/probes/startup to return 503 without any response body
  • Bringing back up Opensearch/ES makes the endpoint {{serverURL}}/api/v1/probes/startup to return 200 with a full JSON response

dsilvam avatar Sep 11 '24 16:09 dsilvam

Approved: Tested on trunk_ab4b1df, Docker, macOS 14.5, FF v126.0.1

josemejias11 avatar Sep 11 '24 21:09 josemejias11

Internal QA: PASSED

  • Docker Image : trunk_49902f3

The ES Health Check must return true under these circumstances:

  • The index is present and active.
  • The index is de-activated.
  • The index doesn't exist at all.

And will return the expected 503 when the ES Server is down completely.

jcastro-dotcms avatar Sep 17 '24 22:09 jcastro-dotcms

Approved: Tested on ** trunk_702df61**, Docker, macOS 14.5, FF v126.0.1

josemejias11 avatar Sep 20 '24 17:09 josemejias11