source-controller icon indicating copy to clipboard operation
source-controller copied to clipboard

[bug] HelmRepository blocked if secret on startup not exists

Open genofire opened this issue 2 years ago • 16 comments

Steps:

  • install fluxcd
  • add HelmRepository CRDs with secretRef
  • wait till HelmRepo failed
  • add Secret (which was referenced)
    • or unseal by sealedsecet ...
  • ....

Error Behavour:

  • HelmRepository does not reconcile with new working secret

Expected Behavour:

  • HelmRepository reconcile after given time / interval

Workaround:

  • kill / restart source-controller pod

in fluxcd, version:

  • 0.41.2
  • 2.0.0-rc5
  • 2.0.1

genofire avatar Jul 21 '23 08:07 genofire

Does it happen with the ErrorHandling here?

https://github.com/fluxcd/source-controller/blob/7f40be76e90b2d4afe9f8f9d7f53ac719fe1205e/internal/controller/helmrepository_controller.go#L411-L416

On the GitRepository (where it works), there we god an "Generic" Error: https://github.com/fluxcd/source-controller/blob/7f40be76e90b2d4afe9f8f9d7f53ac719fe1205e/internal/controller/gitrepository_controller.go#L485-L491

maybe the old error lead them to block permenently

genofire avatar Jul 21 '23 08:07 genofire

Hi, I just tried it but I couldn't reproduce it. I created the following helmrepo:

apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: HelmRepository
metadata:
  name: podinfo
  namespace: default
spec:
  interval: 1m
  url: https://stefanprodan.github.io/podinfo
  secretRef:
    name: "example"

The secret doesn't exist yet. Got the following errors in the logs

{"level":"error","ts":"2023-07-21T16:04:40.646+0530","msg":"Reconciler error","controller":"helmrepository","controllerGroup":"source.toolkit.fluxcd.io","controllerKind":"HelmRepository","HelmRepository":{"name":"podinfo","namespace":"default"},"namespace":"default","name":"podinfo","reconcileID":"3276abd8-8a54-4057-8bc7-ab7664327a44","error":"failed to get secret 'default/example': secrets "example" not found"}

The status of helmrepo shows (kubectl get helmrepository podinfo -o yaml):

status:                                                                                                                                                                      
  conditions:                                                                         
  - lastTransitionTime: "2023-07-21T10:34:45Z"                                                                                                                               
    message: building artifact                                                                                                                                               
    observedGeneration: 1                                                             
    reason: ProgressingWithRetry                                                                                                                                             
    status: "True"                                                                    
    type: Reconciling                                                                                                                                                        
  - lastTransitionTime: "2023-07-21T10:34:45Z"                                                                                                                               
    message: 'failed to get secret ''default/example'': secrets "example" not found'                                                                                               
    observedGeneration: 1                                                             
    reason: AuthenticationFailed                                                      
    status: "False"                                                                   
    type: Ready                                                                                                                                                              
  - lastTransitionTime: "2023-07-21T10:34:40Z"                                                                                                                               
    message: 'failed to get secret ''default/example'': secrets "example" not found'        
    observedGeneration: 1                                                                                                                                                    
    reason: AuthenticationFailed                                                      
    status: "True"      
    type: FetchFailed
  observedGeneration: -1

After creating the secret, within a few seconds, the logs show

{"level":"info","ts":"2023-07-21T16:06:16.387+0530","msg":"stored fetched index of size 43.13kB from 'https://stefanprodan.github.io/podinfo'","controller":"helmrepository", "controllerGroup":"source.toolkit.fluxcd.io","controllerKind":"HelmRepository","HelmRepository":{"name":"podinfo","namespace":"default"},"namespace":"default","name":"podinf o","reconcileID":"96dcf686-9538-462e-b832-be6f1f873be5"}

and the helmrepo status shows:

status:
  artifact:
    digest: sha256:80b091a3a69b9ecfebde40ce2a5f19e95f8f8ea956bd5635a31701f7fad1616e
    lastUpdateTime: "2023-07-21T10:36:16Z"
    path: helmrepository/default/podinfo/index-80b091a3a69b9ecfebde40ce2a5f19e95f8f8ea956bd5635a31701f7fad1616e.yaml
    revision: sha256:80b091a3a69b9ecfebde40ce2a5f19e95f8f8ea956bd5635a31701f7fad1616e
    size: 43126
    url: http://:0/helmrepository/default/podinfo/index-80b091a3a69b9ecfebde40ce2a5f19e95f8f8ea956bd5635a31701f7fad1616e.yaml
  conditions:
  - lastTransitionTime: "2023-07-21T10:36:16Z"
    message: 'stored artifact: revision ''sha256:80b091a3a69b9ecfebde40ce2a5f19e95f8f8ea956bd5635a31701f7fad1616e'''
    observedGeneration: 1
    reason: Succeeded
    status: "True"
    type: Ready
  - lastTransitionTime: "2023-07-21T10:36:16Z"
    message: 'stored artifact: revision ''sha256:80b091a3a69b9ecfebde40ce2a5f19e95f8f8ea956bd5635a31701f7fad1616e'''
    observedGeneration: 1
    reason: Succeeded
    status: "True"
    type: ArtifactInStorage
  observedGeneration: 1
  ...

An object can get blocked if they have a Stalled condition in the status, which we don't in this case. Can you check the status of the blocked helmrepo and share?

darkowlzz avatar Jul 21 '23 10:07 darkowlzz

@genofire when reporting bugs please say which version you're using by simply posting the flux check output.

stefanprodan avatar Jul 21 '23 10:07 stefanprodan

► checking prerequisites
✔ Kubernetes 1.24.6 >=1.24.0-0
► checking controllers
✔ helm-controller: deployment ready
► ghcr.io/fluxcd/helm-controller:v0.34.1
✔ kustomize-controller: deployment ready
► ghcr.io/fluxcd/kustomize-controller:v1.0.0-rc.4
✔ notification-controller: deployment ready
► ghcr.io/fluxcd/notification-controller:v1.0.0-rc.4
✔ source-controller: deployment ready
► ghcr.io/fluxcd/source-controller:v1.0.0-rc.5
► checking crds
✔ alerts.notification.toolkit.fluxcd.io/v1beta2
✔ buckets.source.toolkit.fluxcd.io/v1beta2
✔ gitrepositories.source.toolkit.fluxcd.io/v1
✔ helmcharts.source.toolkit.fluxcd.io/v1beta2
✔ helmreleases.helm.toolkit.fluxcd.io/v2beta1
✔ helmrepositories.source.toolkit.fluxcd.io/v1beta2
✔ kustomizations.kustomize.toolkit.fluxcd.io/v1
✔ ocirepositories.source.toolkit.fluxcd.io/v1beta2
✔ providers.notification.toolkit.fluxcd.io/v1beta2
✔ receivers.notification.toolkit.fluxcd.io/v1
✔ all checks passed

genofire avatar Jul 21 '23 10:07 genofire

That's the CLI version, what about controllers and CRDs? flux check prints those.

stefanprodan avatar Jul 21 '23 11:07 stefanprodan

no i mean, that the namespace has the version-label of 2.0.0-rc5 - have edit / update the message

genofire avatar Jul 21 '23 11:07 genofire

Can you please upgrade to Flux v2.0.1 and see if this issue persists?

stefanprodan avatar Jul 21 '23 12:07 stefanprodan

That needs time -> we have 30 clusters with staging

genofire avatar Jul 21 '23 12:07 genofire

Not asking you to upgrade all of them, just one to rerun the test. We've tried to replicate this with 2.0.1 and the HelmRepository is not getting stuck. Also what type of repo are you using? OCI or Helm HTTP?

stefanprodan avatar Jul 21 '23 12:07 stefanprodan

It wold also be helpful if you can post here kubectl get helmrepository --show-managed-field -oyaml for the one that's stuck.

stefanprodan avatar Jul 21 '23 12:07 stefanprodan

so secret exists for 31 minutes, now:

helmrepo:

apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: HelmRepository
metadata:
  annotations:
    meta.helm.sh/release-name: infra-infra-base
    meta.helm.sh/release-namespace: infra
  creationTimestamp: "2023-07-21T12:46:01Z"
  generation: 1
  labels:
    app.kubernetes.io/managed-by: Helm
    helm.toolkit.fluxcd.io/name: infra-base
    helm.toolkit.fluxcd.io/namespace: flux-system
  managedFields:
  - apiVersion: source.toolkit.fluxcd.io/v1beta2
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:meta.helm.sh/release-name: {}
          f:meta.helm.sh/release-namespace: {}
        f:labels:
          .: {}
          f:app.kubernetes.io/managed-by: {}
          f:helm.toolkit.fluxcd.io/name: {}
          f:helm.toolkit.fluxcd.io/namespace: {}
      f:spec:
        .: {}
        f:interval: {}
        f:provider: {}
        f:secretRef:
          .: {}
          f:name: {}
        f:timeout: {}
        f:url: {}
    manager: helm-controller
    operation: Update
    time: "2023-07-21T12:46:01Z"
  - apiVersion: source.toolkit.fluxcd.io/v1beta2
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        f:conditions: {}
    manager: source-controller
    operation: Update
    subresource: status
    time: "2023-07-21T12:49:23Z"
  name: opstree
  namespace: infra
  resourceVersion: "32531330664"
  uid: 3021755c-d010-454f-8b88-fecf6ded654f
spec:
  interval: 5m
  provider: generic
  secretRef:
    name: internal-artifactory-auth
  timeout: 60s
  url: https://repo-ex.internal.de/artifactory/ot-container-kit-helm-remote/
status:
  conditions:
  - lastTransitionTime: "2023-07-21T12:49:23Z"
    message: building artifact
    observedGeneration: 1
    reason: ProgressingWithRetry
    status: "True"
    type: Reconciling
  - lastTransitionTime: "2023-07-21T12:49:23Z"
    message: 'failed to get secret ''infra/internal-artifactory-auth'': secrets "internal-artifactory-auth"
      not found'
    observedGeneration: 1
    reason: AuthenticationFailed
    status: "False"
    type: Ready
  - lastTransitionTime: "2023-07-21T12:46:02Z"
    message: 'failed to get secret ''infra/internal-artifactory-auth'': secrets "internal-artifactory-auth"
      not found'
    observedGeneration: 1
    reason: AuthenticationFailed
    status: "True"
    type: FetchFailed
  observedGeneration: -1

oci helmrepo:

apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: HelmRepository
metadata:
  annotations:
    meta.helm.sh/release-name: infra-infra-base
    meta.helm.sh/release-namespace: infra
  creationTimestamp: "2023-07-21T12:46:01Z"
  generation: 1
  labels:
    app.kubernetes.io/managed-by: Helm
    helm.toolkit.fluxcd.io/name: infra-base
    helm.toolkit.fluxcd.io/namespace: flux-system
  managedFields:
  - apiVersion: source.toolkit.fluxcd.io/v1beta2
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:meta.helm.sh/release-name: {}
          f:meta.helm.sh/release-namespace: {}
        f:labels:
          .: {}
          f:app.kubernetes.io/managed-by: {}
          f:helm.toolkit.fluxcd.io/name: {}
          f:helm.toolkit.fluxcd.io/namespace: {}
      f:spec:
        .: {}
        f:interval: {}
        f:provider: {}
        f:secretRef:
          .: {}
          f:name: {}
        f:timeout: {}
        f:type: {}
        f:url: {}
    manager: helm-controller
    operation: Update
    time: "2023-07-21T12:46:01Z"
  - apiVersion: source.toolkit.fluxcd.io/v1beta2
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        f:conditions: {}
    manager: source-controller
    operation: Update
    subresource: status
    time: "2023-07-21T12:49:23Z"
  name: weave-gitops
  namespace: infra
  resourceVersion: "32531330680"
  uid: 2a2a7e1b-a809-4992-9f76-a8c5d7650133
spec:
  interval: 60m0s
  provider: generic
  secretRef:
    name: internal-artifactory-auth
  timeout: 60s
  type: oci
  url: oci://docker-virtual.repo-ex.internal.de/weaveworks/charts
status:
  conditions:
  - lastTransitionTime: "2023-07-21T12:49:22Z"
    message: 'processing object: new generation -1 -> 1'
    observedGeneration: 1
    reason: ProgressingWithRetry
    status: "True"
    type: Reconciling
  - lastTransitionTime: "2023-07-21T12:46:02Z"
    message: 'failed to get secret ''infra/internal-artifactory-auth'': secrets "internal-artifactory-auth"
      not found'
    observedGeneration: 1
    reason: AuthenticationFailed
    status: "False"
    type: Ready
  observedGeneration: -1


genofire avatar Jul 21 '23 13:07 genofire

If you run flux reconcile helmrepository does it find the secret or the same thing happens?

stefanprodan avatar Jul 21 '23 13:07 stefanprodan

if i trigger it twice:

# flux reconcile source helm -n infra weave-gitops                            
► annotating HelmRepository weave-gitops in infra namespace
✔ HelmRepository annotated
◎ waiting for HelmRepository reconciliation
✗ HelmRepository reconciliation failed: 'failed to get secret 'infra/internal-artifactory-auth': secrets "internal-artifactory-auth" not found'


# flux reconcile source helm -n infra weave-gitops
► annotating HelmRepository weave-gitops in infra namespace
✔ HelmRepository annotated
◎ waiting for HelmRepository reconciliation
✔ Helm repository is ready


genofire avatar Jul 21 '23 14:07 genofire

This is really strange, is your Kubernetes API under heavy load, is etcd having any issue? This may be a caching issue, we have disabled the caching of Secrets in our controllers but the API does it as well.

stefanprodan avatar Jul 21 '23 14:07 stefanprodan

It is your cloud provider IONOS ... we have no control over the etcd. my problem ist, i do not see any logs above a reconcileing of this helmrepository (other i see) ... like it is in stall.

we have that problem daily over two month (always if we create a new cluster and install there your default resources)

genofire avatar Jul 21 '23 14:07 genofire

if you are right, that the kube-api request is under heavy load, so maybe we should timeout request there (maybe that is the problem), here my code: https://github.com/fluxcd/pkg/pull/627

genofire avatar Aug 14 '23 13:08 genofire