flux2 icon indicating copy to clipboard operation
flux2 copied to clipboard

Reconciliation doesn't progress if it encounters errors with the vscaledobject.kb.io admission webhook

Open vytautaskubilius opened this issue 11 months ago • 6 comments

Describe the bug

When deploying applications that have a KEDA ScaledObject, if there are any issues with the configuration that result in errors with the vscaledobject.kb.io admission webhook Flux fails to reconcile the Kustomization even after the problems are addressed.

Steps to reproduce

Create a simple Kustomization that consists of a Deployment and a ScaledObject. The Deployment manifest intentionally has the resources section commented out to induce the initial error.

kustomization.yaml:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: default
resources:
  - test.yaml

test.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test
spec:
  selector:
    matchLabels:
      app: test
  template:
    metadata:
      labels:
        app: test
    spec:
      containers:
      - name: test
        image: busybox
        command: ["sleep", "infinity"]
      # resources:
      #   requests:
      #     cpu: 50m
      #     memory: 50Mi

---

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: test
spec:
  maxReplicaCount: 2
  minReplicaCount: 1
  scaleTargetRef:
    name: test
  triggers:
    - metadata:
        value: "50"
      type: cpu
      metricType: Utilization

Flux Kustomization:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: test
  namespace: flux-system
spec:
  interval: 10m0s
  sourceRef:
    kind: GitRepository
    name: flux-system
  path: ./kustomize/apps/test
  prune: true

When this Kustomization is applied, the reconciliation will fail with an error similar to this one:

ScaledObject/default/test dry-run failed, reason: Forbidden: admission webhook "vscaledobject.kb.io" denied the request: the scaledobject has a cpu trigger but the container test doesn't have the cpu request defined

After the configuration issue is addressed (i.e. the resources section is uncommented above) and the configuration is committed to git, Flux continues to report the error even after attempting to manually reconcile the git source and the Kustomization itself. Applying the configuration using kubectl apply -k . results in success, and a subsequent flux reconcile kustomization test then works.

Expected behavior

Flux automatically picks up the newest changes that contain configuration fixes and applies them to the Kustomization.

Screenshots and recordings

No response

OS / Distro

N/A

Flux version

v2.1.1

Flux check

► checking prerequisites ✔ Kubernetes 1.27.4-eks-2d98532 >=1.25.0-0 ► checking controllers ✔ helm-controller: deployment ready ► ghcr.io/fluxcd/helm-controller:v0.36.1 ✔ image-automation-controller: deployment ready ► ghcr.io/fluxcd/image-automation-controller:v0.36.1 ✔ image-reflector-controller: deployment ready ► ghcr.io/fluxcd/image-reflector-controller:v0.30.0 ✔ kustomize-controller: deployment ready ► ghcr.io/fluxcd/kustomize-controller:v1.1.0 ✔ notification-controller: deployment ready ► ghcr.io/fluxcd/notification-controller:v1.1.0 ✔ source-controller: deployment ready ► ghcr.io/fluxcd/source-controller:v1.1.1 ► checking crds ✔ alerts.notification.toolkit.fluxcd.io/v1beta2 ✔ buckets.source.toolkit.fluxcd.io/v1beta2 ✔ gitrepositories.source.toolkit.fluxcd.io/v1 ✔ helmcharts.source.toolkit.fluxcd.io/v1beta2 ✔ helmreleases.helm.toolkit.fluxcd.io/v2beta1 ✔ helmrepositories.source.toolkit.fluxcd.io/v1beta2 ✔ imagepolicies.image.toolkit.fluxcd.io/v1beta2 ✔ imagerepositories.image.toolkit.fluxcd.io/v1beta2 ✔ imageupdateautomations.image.toolkit.fluxcd.io/v1beta1 ✔ kustomizations.kustomize.toolkit.fluxcd.io/v1 ✔ ocirepositories.source.toolkit.fluxcd.io/v1beta2 ✔ providers.notification.toolkit.fluxcd.io/v1beta2 ✔ receivers.notification.toolkit.fluxcd.io/v1 ✔ all checks passed

Git provider

GitHub

Container Registry provider

No response

Additional context

No response

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

vytautaskubilius avatar Sep 26 '23 17:09 vytautaskubilius

I am having this issue as well where the webhook can fail transiently where the deployments don't exist yet that the scaledobjects are looking for depending on a race. Even after the deployments exist the helm release will not reconcile until I've suspended and resumed it.

23caterpie avatar Oct 31 '23 21:10 23caterpie

Same here.

ceilingfish avatar Nov 07 '23 16:11 ceilingfish

Same here.

jonathan-fileread avatar Dec 15 '23 21:12 jonathan-fileread

might be a keda error though, im on argocd

jonathan-fileread avatar Dec 15 '23 21:12 jonathan-fileread

I've been able to get around this by manually applying the deployment to the cluster, which then allows flux to create the scaledobject and continue reconciling normally. It's an annoying workaround, but at least you only have to do it once when setting up a new service

TheEdgeOfRage avatar Dec 29 '23 11:12 TheEdgeOfRage

Same here with ArgoCD 2.9.5 and Custom Metric Autoscaler (KEDA) 2.11.2 on OpenShift 4.12.30: once fixed the Deployment with missing CPU requests, ScaledObject is still not created and keda-admission Pod claims that:

2024-02-16T16:05:43Z ERROR scaledobject-validation-webhook validation error {"error": "the scaledobject has a cpu trigger but the container XXXXX doesn't have the cpu request defined"}

elmazzun avatar Feb 16 '24 16:02 elmazzun

We've discussed about this issue with Keda devs in Flux Slack some months ago and they are aware of it. If it's not fixed please open an issue in Keda repo, there is nothing we can about it in Flux.

stefanprodan avatar Apr 17 '24 17:04 stefanprodan