helm-controller icon indicating copy to clipboard operation
helm-controller copied to clipboard

Unable to depend on crossnamespace helmrelease when watch-all-namespaces=false

Open stgrace opened this issue 1 year ago • 6 comments

When creating a dependency for a helmrelease in a different namespace, the controller throws an error that it cannot find a helmrelease in a different namespace when --watch-all-namespaces=false.

I feel like the watch-all-namespaces flag should only be used for reconciliation, not for dependencies. In our case the helm-controller is given enough rights to check for the status of another helmrelease in another namespace but is blocked by this watch-all-namespaces flag. It's also not possible to remove this flag, since this is the only way to horizontally scale the helm-controllers in our cluster. One helm-controller for most of our clusters would not be able to reconcile everything properly.

stgrace avatar Aug 26 '22 14:08 stgrace

One helm-controller for most of our clusters would not be able to reconcile everything properly.

Have you tried setting --concurrent=100, this should make helm-controller perform 100 installs/upgrades in parallel. Docs here: https://fluxcd.io/flux/cheatsheets/bootstrap/#increase-the-number-of-workers

stefanprodan avatar Aug 31 '22 14:08 stefanprodan

Hi Stefan,

Thanks for the reply. We have indeed upped the concurrency for the helm controller, but it would still be nice to have multiple controllers per cluster. I feel the behavior here is a little misleading, as the controller is not actively watching the declarative state of dependencies (as described here: https://github.com/cncf/tag-app-delivery/blob/eece8f7307f2970f46f100f51932db106db46968/operator-wg/whitepaper/Operator-WhitePaper_v1-0.md#operator-design-pattern), but merely checking its status for readiness. Furthermore, I think the already existing --no-cross-namespace-refs can be used if this is still desired by some users.

stgrace avatar Sep 01 '22 16:09 stgrace

Hi @stefanprodan ,

I'm in a similar situation, and I tried your suggestion. But what resources do you suggest? The docs show 2Gi for 100 concurrent, but that seems to be not even close to sufficient.

The following causes the helm-controller to go OOMKilled, I think 6Gi is already a stretch.

Args:
      --events-addr=http://notification-controller/
      --concurrent=100
      --watch-all-namespaces=true
      --events-addr=http://notification-controller/
      --log-level=info
      --log-encoding=json
      --enable-leader-election
Limits:
      cpu:     2
      memory:  6Gi
Requests:
      cpu:      1
      memory:   2Gi

There are about 1000 helmreleases in my cluster with interval at 5min, installs take about 20 seconds. It doesn't crash immediately, only after 20-30min, sometimes even longer (probably depending on what is being installed.)

kvandenhoute avatar Sep 29 '22 14:09 kvandenhoute

@kvandenhoute what you've described sounds like a memory leak, if 1K HRs are installed in 20 seconds, but then the controller runs into OOM after half hour makes me think it's a GC issue.

stefanprodan avatar Sep 29 '22 15:09 stefanprodan

@kvandenhoute would you be able to collect a heap profile? Instructions for this are here: https://fluxcd.io/flux/gitops-toolkit/debugging/#collecting-a-profile

hiddeco avatar Sep 29 '22 15:09 hiddeco

Hi @stefanprodan @hiddeco,

First let me clarify; 20sec per helmrelease install, not 20 sec for 1k hr's . I deleted about 250 HR's, which were re-added a little later by our kustomizations. I have attached the heap requested about 15-ish seconds before it crashed. It took 5Gi+ at that time. I also attached the pod definition.

podandheap.zip

kvandenhoute avatar Sep 30 '22 07:09 kvandenhoute