nginx-gateway-fabric icon indicating copy to clipboard operation
nginx-gateway-fabric copied to clipboard

HPA and NGF Controller Conflicting

Open cmbankester opened this issue 1 month ago • 5 comments
trafficstars

Describe the bug

When autoscaling.enable: true is configured in the Helm chart, the NGF controller updates the deployment and modifies the spec.replicas field in conflict with the HPA. This causes the deployment to scale up and down in the same second, resulting in constant pod churn and preventing the HPA from scaling up or down consistently.

To Reproduce

  1. Deploy NGF with autoscaling enabled using these Helm values:
nginx:
  autoscaling:
    enable: true
    metrics:
      - external:
          metric:
            name: <some-external-metric-providing-connection-count-across-all-replicas>
          target:
            type: Value
            value: 20000
        type: External
    minReplicas: 1
    maxReplicas: 10
  1. Wait for HPA to trigger a scale-down event

  2. Observe scale events:

kubectl get events -n ngf --sort-by='.lastTimestamp' -o custom-columns='when:lastTimestamp,msg:message,reason:reason,obj:involvedObject.name,cmp:source.component' | grep -E "SuccessfulRescale|ScalingReplicaSet"
  1. Check who last updated the deployment replicas
kubectl get deployment nginx-public-gateway-nginx -n ngf --show-managed-fields -o json | \
  jq '.metadata.managedFields[] | select(.fieldsV1."f:spec"."f:replicas") | {manager: .manager, operation: .operation, time: .time}'

Expected behavior

When autoscaling.enable: true, the NGF controller should:

  1. Create the HPA resource
  2. Not change the spec.replicas field after HPA is created
  3. Allow HPA to be the sole controller managing replica count

Your environment

  • Version of NGINX Gateway Fabric: 2.1.2 (commit: 877c415d596ebb86b61f20ed77c7db8847a10f6c, date: 2025-09-25T19:31:07Z)
  • Kubernetes Version: v1.32.6
  • Platform: Azure Kubernetes Service (AKS)
  • Exposure method: Service type LoadBalancer
  • Helm Chart Version: nginx-gateway-fabric-2.1.2

Observed behavior

Events show deployment scaling up and down in the same second:

> kubectl get events -n immy-routing --sort-by='.lastTimestamp' -o custom-columns='when:lastTimestamp,msg:message,reason:reason,obj:involvedObject.name,cmp:source.component' | grep -E "SuccessfulRescale|ScalingReplicaSet"
2025-10-02T18:17:53Z   New size: 10; reason: external metric datadogmetric@immy-routing:nginx-connection-count-connections(nil) above target      SuccessfulRescale   nginx-public-gateway-nginx                      horizontal-pod-autoscaler
2025-10-02T18:19:38Z   Scaled down replica set nginx-public-gateway-nginx-57b699c549 from 10 to 8                                                 ScalingReplicaSet   nginx-public-gateway-nginx                      deployment-controller
2025-10-02T18:19:38Z   Scaled up replica set nginx-public-gateway-nginx-57b699c549 from 8 to 10                                                   ScalingReplicaSet   nginx-public-gateway-nginx                      deployment-controller
2025-10-02T18:21:38Z   Scaled down replica set nginx-public-gateway-nginx-57b699c549 from 10 to 9                                                 ScalingReplicaSet   nginx-public-gateway-nginx                      deployment-controller
2025-10-02T18:21:38Z   Scaled up replica set nginx-public-gateway-nginx-57b699c549 from 9 to 10                                                   ScalingReplicaSet   nginx-public-gateway-nginx                      deployment-controller
2025-10-02T18:25:23Z   Scaled up replica set ngf-nginx-gateway-fabric-74db69c968 from 0 to 1                                                      ScalingReplicaSet   ngf-nginx-gateway-fabric                        deployment-controller
2025-10-02T18:25:26Z   Scaled down replica set ngf-nginx-gateway-fabric-7b99997d79 from 1 to 0                                                    ScalingReplicaSet   ngf-nginx-gateway-fabric                        deployment-controller
2025-10-02T18:25:39Z   New size: 9; reason: All metrics below target                                                                              SuccessfulRescale   nginx-public-gateway-nginx                      horizontal-pod-autoscaler
2025-10-02T18:51:42Z   Scaled down replica set nginx-public-gateway-nginx-57b699c549 from 9 to 8                                                  ScalingReplicaSet   nginx-public-gateway-nginx                      deployment-controller
2025-10-02T18:51:42Z   New size: 8; reason: All metrics below target                                                                              SuccessfulRescale   nginx-public-gateway-nginx                      horizontal-pod-autoscaler

Checking managed fields confirms NGF controller ("gateway" manager) is modifying replicas in the same second as the hpa:

> kubectl get deployment nginx-public-gateway-nginx -n immy-routing --show-managed-fields -o json | \
  jq '.metadata.managedFields[] | select(.fieldsV1."f:spec"."f:replicas") | {manager: .manager, operation: .operation, time: .time}'
{
  "manager": "gateway",
  "operation": "Update",
  "time": "2025-10-02T18:51:42Z"
}

And the replica count set by the HPA has been overwritten back to the old value:

> kubectl get deployment nginx-public-gateway-nginx -n immy-routing -o json | jq '.spec.replicas'     
9

Additional context

Suspected root cause: The NGF controller is updating the deployment, including the spec.replicas field, even when HPA is enabled and results a race condition:

  1. HPA decides to scale (e.g., 10 → 8 replicas)
  2. HPA updates deployment .spec.replicas: 8
  3. Deployment terminates relevant pods
  4. NGF controller reconciles and resets .spec.replicas back to the old value (e.g., 10)
  5. Deployment spins up pods again

Impact on production:

  • Pods restart every 2 minutes (matching HPA scale-down period)
  • Thousands of websocket connections dropped on each restart
  • Connection storms after scale-downs cause metric spikes
  • HPA unable to effectively manage scaling due to constant interference

cmbankester avatar Oct 02 '25 19:10 cmbankester

Hi @cmbankester! Welcome to the project! 🎉

Thanks for opening this issue! Be sure to check out our Contributing Guidelines and the Issue Lifecycle while you wait for someone on the team to take a look at this.

nginx-bot[bot] avatar Oct 02 '25 19:10 nginx-bot[bot]

Hey @cmbankester, thanks for submitting this issue.

Just to clarify — the Helm values you shared configure autoscaling for the NGINX data plane deployment.

However, the logs and events you attached are related to the NGF control plane pod, not the data plane.

If your goal is to scale the NGF control plane pod, you’ll want to configure autoscaling under the nginxGateway section instead, for example:

nginxGateway:
  autoscaling:
    enable: true
    metrics:
      - external:
          metric:
            name: <some-external-metric-providing-connection-count-across-all-replicas>
          target:
            type: Value
            value: 20000
        type: External
    minReplicas: 1
    maxReplicas: 10

Once I have more information, I can dig into this further.

salonichf5 avatar Oct 03 '25 20:10 salonichf5

I am fairly certain that my data plane deployment and pod names are formatted nginx-gateway-fabric-nginx and these events and logs all refer to the data plane deployment and pods. The behavior I am seeing is the data plane hpa's calculated replica count being applied to the data plane deployment and then immediately replaced

cmbankester avatar Oct 03 '25 22:10 cmbankester

Sounds good. I’ll test this issue again today with some actual traffic. I was testing on Kind before and didn’t have enough load to trigger this behavior. I’ll keep you updated — I didn’t see an issue earlier, so I wanted to cross-check with you.

salonichf5 avatar Oct 06 '25 16:10 salonichf5

Thanks! Yes I think the behavior may be related to increased load, as I did not see this occur in our dev/qa clusters but I saw it happen in multiple prod clusters that have 6+ data plane pods. Also worth mentioning I haven't tried scaling my control plane pods, so all my clusters have 1 control plane pod. I have since reverted the hpa from the prod clusters, but I still have it enabled in two qa clusters, and I should be able to simulate some load to try to replicate the behavior.

Let me know if you'd like me to try anything out

cmbankester avatar Oct 06 '25 18:10 cmbankester

Hey @cmbankester, we will releasing a fix for this as part of your patch release (v2.3.1) next week. Hope it improves things for you.

salonichf5 avatar Nov 07 '25 18:11 salonichf5