helm-charts Neo4J Helm Chart - Statefulset Readiness/Start Up Probe Reports Healthy Before All Databases are Fully Started - Support customizing Neo4J readiness/start up probes or update Neo4J readiness/start up probe implementation to address

Neo4J Helm Chart - Statefulset Readiness/Start Up Probe Reports Healthy Before All Databases are Fully Started - Support customizing Neo4J readiness/start up probes or update Neo4J readiness/start up probe implementation to address

Open ryanmcafee opened this issue 7 months ago • 3 comments

Is your feature request related to a problem? Please describe.

At JupiterOne, we are an enterprise customer of Neo4J and manage the deployment of Neo4J clusters on Kubernetes using the upstream Neo4J Helm chart. We are experiencing an issue during rolling deployments/updates to Neo4J clusters (using ArgoCD App of Apps pattern and ArgoCD Sync Waves) that can lead to service outages for customers with large databases. The readiness probe for a Neo4J cluster member Kubernetes pod is reported as healthy before all databases are fully started and available. This premature healthy status will cause the ArgoCD sync wave to progress. Multiple Neo4J clusters being updated in quick succession like this will cause a loss in write quorum, since a database across multiple neo4j cluster members will be in a "starting" state, not an "online" state resulting in service outages (loss of write and severely degraded read performance).

Additional Context

We are using the v5.20.0 version of the Neo4J helm chart.

To manage the deployment of our Neo4J clusters, we are using Terraform + ArgoCD to deploy and manage our Neo4J clusters.

During an automated rolling deployment with ArgoCD and ArgoCD sync waves, the readiness probe will report a Neo4J cluster member k8s pod as healthy before all databases are fully started and available.

The readiness probe status is propagated up the parent ArgoCD application, which reports the Neo4J cluster member as healthy, which allows ArgoCD to progress to syncing the next child application.

We have found that this can lead to a situation for larger databases that can take 10-15 minutes to start, where the readiness probe reports a Neo4J cluster member k8s pod as healthy, but all databases are not fully started and available.

In the case for a 3 node cluster, we have found cases where all 3 nodes are reported as healthy by the readiness probe, but a (non neo4j/system) database is still starting on all 3 nodes, which leads to a loss of quorum for the database and a complete service outage for the database.

Getting further into supporting details:

We are using the ArgoCD app of apps pattern to deploy a root neo4j-cluster ArgoCD application, which will deploy as children the cluster members for a Neo4j cluster. The child Neo4J cluster members have an ArgoCD sync wave (https://argo-cd.readthedocs.io/en/stable/user-guide/sync-waves/) specified, which allows us to perform rolling deployments in order and wait for a healthy status of a Neo4J cluster member.

Our ArgoCD app of apps pattern is working well for deploying and managing our Neo4J clusters, but we are seeing an issue with the readiness probe for a Neo4j cluster member k8s pod.

Upon investigation, we found that the Neo4J helm chart today does not support custom health checks/overrides for the readiness, startup, or liveness probes https://github.com/neo4j/helm-charts/blob/dev/neo4j/templates/neo4j-statefulset.yaml#L158, but instead works by performing a TCP check on the bolt port of a Neo4j cluster member k8s pod. The bolt endpoint TCP health check does not take into consideration if all non neo4j/system databases have first been started.

We are seeing an issue wherein a Neo4j cluster member k8s pod is considered/reported healthy by the readiness probe before all databases for a Neo4j cluster member are fully started and available.

It would appear that the bolt port is available before all databases are fully started and available, which is why the readiness probe is reporting the pod as healthy before all databases are fully started and available.

We consider the current readiness probe dangerous because you can quickly get into a situation where a Neo4j cluster member k8s pod is considered healthy by the readiness probe, but all databases are not fully started and available.

Once the readiness probe reports a pod as healthy, the ArgoCD sync wave will proceed to the next pod in the deployment, which can lead to a situation where the Neo4j cluster member k8s pod is considered healthy by the readiness probe, but all databases are not fully started and available.

An example of our ArgoCD app of apps pattern for the management of a Neo4j cluster: neo4j-cluster-001-dev looks like the following:

neo4j-cluster-001-dev - neo4j-cluster argocd application (Internal) - ArgoCD Neo4J app of apps neo4j-cluster-001-p-002-dev - Neo4J Cluster Member - Using upstream helm chart - https://github.com/neo4j/helm-charts/blob/dev/neo4j neo4j-cluster-001-p-002-dev - Neo4J Cluster Member - Using upstream helm chart - https://github.com/neo4j/helm-charts/blob/dev/neo4j neo4j-cluster-001-p-003-dev - Neo4J Cluster Member - Using upstream helm chart - https://github.com/neo4j/helm-charts/blob/dev/neo4j

For example, let's assume that we have a Neo4j cluster wherein we have updated Neo4j configuration settings:

We will have made an update to the neo4j-cluster argocd root application, which will trigger a rolling deployment of the neo4j-cluster members by the ArgoCD application controller.

Rolling deployment process:

ArgoCD syncs the update for neo4j-cluster-001-p-001-dev and waits for the Neo4j container readiness probe to report healthy.

The readiness probe reports the pod as healthy before all databases are fully started and available as indicated by: SHOW DATABASES The application health status is propagated up the parent ArgoCD application, which reports the neo4j-cluster-001-p-001-dev member as healthy, which allows ArgoCD to progress to syncing the next child application: neo4j-cluster-001-p-002-dev

SHOW DATABASES output for neo4j-cluster-001-p-001-dev after readiness probe reports pod as healthy: Databases: neo4j: requestedStatus: online currentStatus: online system: requestedStatus: online currentStatus: online some-large-database: requestedStatus: online currentStatus: starting

ArgoCD syncs the update for neo4j-cluster-001-p-002-dev and waits for the Neo4j container readiness probe to report healthy.

ArgoCD starts the sync/update for neo4j-cluster-001-p-002-dev and waits for the Neo4j container readiness probe to report healthy (meanwhile some-large-database is still starting for cluster member: neo4j-cluster-001-p-001-dev ).

ArgoCD syncs the update for neo4j-cluster-001-p-002-dev and waits for the Neo4j container readiness probe to report healthy.

neo4j-cluster-001-p-002-dev k8s pod is restarted.

Kubernetes evaluates the readiness probe for the neo4j-cluster-001-p-002-dev k8s pod and reports the pod as healthy before all databases are fully started and available as indicated by: SHOW DATABASES.

At this point, the neo4j-cluster-001-p-002-dev k8s pod is considered healthy by the readiness probe, but all databases are not fully started and available.

Reads available: yes (degraded - high latency - instance overloaded) Writes available: no

The application health status is propagated up the parent ArgoCD application, which reports the neo4j-cluster-001-p-002-dev member also as healthy.

Lost quorum: 2/3 Instances - neo4j-cluster-001-p-001-dev and neo4j-cluster-001-p-002-dev are not available for database: some-large-database some-large-database cluster members available: 1/3 (neo4j-cluster-001-p-003-dev is still available)

ArgoCD progresses to syncing the next child application: neo4j-cluster-001-p-003-dev due to the readiness probe reporting healthy and the neo4j-cluster-001-p-002-dev k8s pod as healthy.

ArgoCD syncs the update for neo4j-cluster-001-p-003-dev and waits for the Neo4j container readiness probe to report healthy.

ArgoCD starts the sync/update for neo4j-cluster-001-p-003-dev and waits for the neo4j container readiness probe to report healthy (meanwhile some-large-database is still starting for cluster member: neo4j-cluster-001-p-001-dev, eo4j-cluster-001-p-002-dev, and now neo4j-cluster-001-p-003-dev ).

ArgoCD syncs the update for neo4j-cluster-001-p-003-dev and waits for the Neo4j container readiness probe to report healthy.

neo4j-cluster-001-p-003-dev k8s pod is restarted to apply update.

Kubernetes evaluates the readiness probe for the neo4j-cluster-001-p-003-dev k8s pod and reports the pod as healthy before all databases are fully started and available as indicated by: SHOW DATABASES.

At this point, the neo4j-cluster-001-p-003-dev k8s pod is considered healthy by the readiness probe, but all databases are not fully started and available.

Databases: neo4j: requestedStatus: online currentStatus: online system: requestedStatus: online currentStatus: online some-large-database: requestedStatus: online currentStatus: starting

The application health status is propagated up the parent ArgoCD application, which reports the neo4j-cluster-001-p-003-dev member also as healthy.

Lost quorum: 3/3 neo4j-cluster-001-p-001-dev, neo4j-cluster-001-p-002-dev, neo4j-cluster-001-p-003-dev are not available for database: some-large-database some-large-database cluster members available: 0/3

Reads available: no Writes available: no

Implications:

Complete service outage for customer with database: some-large-database Sev1 is created for the outage Customer is unhappy Reputational damage Increases risk of customer churn.

Describe the solution you'd like

Option 1:

Customizable Health Checks (Readiness, Liveness, Startup Probes):

Summary: Provide support for customizing health checks/overrides for the readiness, liveness and startup probes in the Neo4J helm chart.

Details: - This will allow us and other customers to customize the health checks/overrides for the readiness, liveness and startup probes in the Neo4J helm chart. - Can preserve the current behavior of the readiness, liveness and startup probes in the Neo4J helm chart as the default behavior, by setting the current settings for the readiness, liveness and startup probes in the values.yaml file instead of hardcoding them in the Neo4J helm chart Statefulset template: https://github.com/neo4j/helm-charts/blob/dev/neo4j/templates/neo4j-statefulset.yaml#L158.

Use cases: Wait for all databases to be fully started and available before the readiness/startup probe reports a neo4j cluster member k8s pod as healthy.

Pros:

Low risk of breaking changes for existing Neo4J customers.
Backward compatibility with current behavior.
Allows customization of health checks to ensure all databases are fully started before marking a pod as healthy.
Reduces the risk of service outages.
Requires less development effort.

Cons:

Not all customers may want to customize health checks.
Optional customization means not all customers will benefit.

Related PR/Implementation: https://github.com/neo4j/helm-charts/pull/338

Option 2:

Update Readiness/Startup Probes Implementation to Wait for All Databases to be Fully Started and Available Before Reporting a Neo4J Cluster Member K8s Pod as Healthy.

Summary: Update the readiness/startup probes default behavior to wait for all databases to be fully started and available before the readiness probe reports a Neo4j cluster member k8s pod as healthy.

Use cases: Wait for all databases to be fully started and available before the readiness/startup probe reports a Neo4j cluster member k8s pod as healthy.

Details:

Update the readiness/startup probes default behavior to wait for all databases to be fully started and available before the readiness probe reports a Neo4J cluster member k8s pod as healthy.
Liveness probe can continue to be used to check the health of the bolt port (non started database should not trigger restart of pod), but should delay the readiness probe from being considered healthy until all databases are fully started and available.

Pros:

Ensures safe rolling deployments by waiting for all databases to start before marking a pod as healthy.
Benefits all customers by preventing a pod from receiving traffic before a database is guaranteed to be available.

Cons:

Not backward compatible with current behavior.
Higher risk of breaking changes for existing customers.
Some customers may not want the readiness probe to wait for all databases to be online first.
More development effort required.
Implementing one of these solutions would help address the issues observed with the current readiness probe setup, enhancing the stability and reliability of Neo4J cluster deployments using ArgoCD.

Describe alternatives you've considered

Forking the Neo4J helm chart to implement the suggested solution or vendoring the Neo4J helm chart into our Neo4j configuration repository.

Thank you for your attention to this matter. We look forward to your feedback on the reported issue and proposed solutions.

Jul 10 '24 20:07 ryanmcafee

helm-charts helm-charts copied to clipboard

Neo4J Helm Chart - Statefulset Readiness/Start Up Probe Reports Healthy Before All Databases are Fully Started - Support customizing Neo4J readiness/start up probes or update Neo4J readiness/start up probe implementation to address

Is your feature request related to a problem? Please describe.

Additional Context

Describe the solution you'd like

Describe alternatives you've considered

helm-charts
helm-charts copied to clipboard