[Features] Support Customizable Probe Intervals at Cluster Level for StarRocks FE

Open iziang opened this issue 1 year ago • 1 comments

Description

We are using KubeBlocks to manage our StarRocks clusters, and we have encountered an issue where the Frontend (FE) nodes need to replay Write-Ahead Logs (WAL) after startup or recovery. This replay process can take a significant amount of time, during which the default liveness and readiness probes fail, leading to the pod being terminated before the replay completes. This prevents the FE from successfully completing the replay process.

I considered using a startupProbe to handle this special probing logic. However, there is no reliable way to detect whether the engine is in a healthy replay state during startup. Therefore, it would be best if KubeBlocks Cluster (KB Cluster) supported custom probe strategies at the cluster level or provided guidance on how to handle such scenarios.

Use Case

Scenario: When a StarRocks FE node starts up or recovers, it needs to replay WAL logs, which can take a long time depending on the volume of logs.
Problem: The default probe settings in the controller-manager (cmpd) are not sufficient for this extended period, causing the probes to fail and the pod to be restarted prematurely.
Need: We require a way to temporarily modify the liveness and readiness probe strategies for a specific cluster, increasing the total probe timeout to ensure that the replay process completes without interruption.

Proposed Solution

Add support for customizing liveness and readiness probe intervals at the cluster level. Specifically:

Cluster-level Configuration: Allow users to specify custom probe intervals directly in the cluster configuration YAML file.
Temporary Override Mechanism: Provide a mechanism to temporarily override the default probe settings for a specific cluster, ensuring that critical operations like WAL replay can complete without interference from failed health checks.
Grace Period: Optionally, introduce a grace period setting that applies only during specific operations (e.g., startup, recovery).

Example Configuration

apiVersion: apps.kubeblocks.io/v1alpha1
kind: Cluster
metadata:
  name: starrocks-cluster
spec:
  # ... other specifications ...
  componentSpecs:
  - componentDef: starrocks-fe-sd
    name: fe
    replicas: 1
    livenessProbe:
      initialDelaySeconds: 600  # Increased delay to accommodate WAL replay
      periodSeconds: 30
      timeoutSeconds: 10
      successThreshold: 1
      failureThreshold: 5
    readinessProbe:
      initialDelaySeconds: 600  # Increased delay to accommodate WAL replay
      periodSeconds: 30
      timeoutSeconds: 10
      successThreshold: 1
      failureThreshold: 3

Benefits

Improved Stability: Ensures that critical processes like WAL replay can complete without unnecessary pod restarts.
Flexibility: Allows operators to fine-tune health checks based on specific operational requirements.
Operational Efficiency: Reduces downtime and ensures smoother operation of StarRocks clusters during maintenance or recovery periods.

Additional Context:

Current Behavior: Default probe settings cause premature termination of pods during lengthy operations like WAL replay.
Desired Behavior: Pods should remain running until critical operations are completed, as specified by customized probe settings.

Dec 06 '24 09:12 iziang

This issue has been marked as stale because it has been open for 30 days with no activity

Jan 06 '25 00:01 github-actions[bot]