es-operator icon indicating copy to clipboard operation
es-operator copied to clipboard

operator stuck in scale down loop

Open a2k47l opened this issue 3 months ago • 4 comments

Expected Behavior

When CPU load is below scaleDownCPUBoundary then replica count should reduce. Thus node count should go down

Actual Behavior

When CPU load is below scaleDownCPUBoundary, index replica count is not reduced. Thus number of nodes does not go down. Logs - time="2024-03-19T06:27:48Z" level=info msg="Waiting for operation to stop" eds=es-mci-data namespace=mci time="2024-03-19T06:27:49Z" level=info msg="Terminating operator loop." eds=es-mci-data namespace=mci time="2024-03-19T06:27:50Z" level=info msg="Waiting for operation to stop" eds=es-mci-data namespace=mci time="2024-03-19T06:27:50Z" level=error msg="Failed to operate resource: failed to update status: Put "https://10.10.0.1:443/apis/zalando.org/v1/namespaces/mci/elasticsearchdatasets/es-mci-data/status?timeout=30s": context canceled" time="2024-03-19T06:27:50Z" level=info msg="Terminating operator loop." eds=es-mci-data namespace=mci time="2024-03-19T06:28:19Z" level=info msg="Scaling hint: DOWN" eds=es-mci-data namespace=mci time="2024-03-19T06:28:49Z" level=info msg="Scaling hint: DOWN" eds=es-mci-data namespace=mci time="2024-03-19T06:29:19Z" level=info msg="Scaling hint: DOWN" eds=es-mci-data namespace=mci

Steps to Reproduce the Problem

  1. I have simple setup with 1 ES cluster with 1 master and 1 EDS managed by es-operator. I have single index with 2 shard.

  2. scaling options - enabled: true minReplicas: 1 maxReplicas: 6 minShardsPerNode: 1 maxShardsPerNode: 1 minIndexReplicas: 0 maxIndexReplicas: 5 scaleUpCPUBoundary: 50 scaleUpCooldownSeconds: 60 scaleUpThresholdDurationSeconds: 30 scaleDownCPUBoundary: 40 scaleDownCooldownSeconds: 60 scaleDownThresholdDurationSeconds: 30 diskUsagePercentScaledownWatermark: 75

  3. When i start basic busybox load generator , the cpu usage increases and es-operator scales up by increasing replica count of index. But when i stop load generator , cpu usage goes down but replica count is not updated. Thus number of nodes remained high

Specifications

  • Version: ES 8.12.2, es-operator: latest(should be 0.1.4)
  • Platform: Gcloud k8s cluster

a2k47l avatar Mar 19 '24 06:03 a2k47l