zarf icon indicating copy to clipboard operation
zarf copied to clipboard

post-install hooks do not get ran until all chart resources are ready/healthy

Open corang opened this issue 2 years ago • 1 comments

Environment

Device and OS: All App version: 0.21.3 Kubernetes distro being used: k3d/k3s Other:

Steps to reproduce

  1. Get a helm chart that utilizes a post-install hook for a resource
  2. Modify helm chart that a pod doesn't become healthy until that post-install hook is ran
  3. Test using just helm - post-install hook runs once all resources are created and all pods end up healthy
  4. Create zarf package using modified helm chart
  5. Deploy package

Expected result

Post-install hook resources are created once all other resources have been created, pod that depends on post-install resource becomes healthy.

Actual Result

Post-install hook never runs so the zarf package never succeeds.

Severity/Priority

Medium - Blocking but has a workaround.

Additional Context

This can be worked around by adding a dataInjection since zarf doesn't wait for resources to be ready before running post-install hook resources.

corang avatar Sep 22 '22 18:09 corang

That's a clever workaround (and fairly new 😂). I actually think we should just expose this as an option in the zarf.yaml for when it's needed. Thoughts @defenseunicorns/zarf?

jeff-mccoy avatar Sep 22 '22 18:09 jeff-mccoy

It would be a relatively simple change in src/internal/helm/chart.go Is there a reason we don't flip the behavior though? The default in helm is opposite the default in Zarf currently (which may cause issues with folks trying something in Zarf and Helm and wondering why one doesn't work and the other does as in this case).

Racer159 avatar Sep 22 '22 19:09 Racer159

what line are you referring to?

jeff-mccoy avatar Sep 22 '22 19:09 jeff-mccoy

If you're referring to our use of --wait, that's because the expectation in how all zarf components works is that the next component won't run until the previous is ready. Changing that default behavior would kind of break * everything *. That doesn't matter for helm where it doesn't know about multiple helm charts chained together, it does matter for zarf. Think the easy answer is to just expose the option in zarf.yaml. This is actually just another poor helm design from older versions that seems to be carried over: https://github.com/helm/helm/issues/5118.

jeff-mccoy avatar Sep 22 '22 19:09 jeff-mccoy

Shouldn't the use of --wait in zarf still allow the post-install hook to run?

The library loads the resulting resources into Kubernetes. Note that if the --wait flag is set, the library will wait until all resources are in a ready state and will not run the post-install hook until they are ready.

link

brandtkeller avatar Sep 22 '22 21:09 brandtkeller

Not according to the linked issue above. And that's all we are doing is setting wait: true, this is helm internal behavior.

jeff-mccoy avatar Sep 22 '22 22:09 jeff-mccoy

This should be investigated further, I created/deployed this package: (helm chart created to try and reproduce the issue)

kind: ZarfPackageConfig
metadata:
  name: "init-package-test"

components:
  - name: test
    description: "testing hooks"
    images:
      - docker.io/nginx:1.16.0
      - docker.io/busybox:latest
    charts:
      - name: hookchart
        url: https://github.com/brandtkeller/testchart.git
        version: 0.1.0
        namespace: test

Zarf properly waits for both the deployment pod to be ready AND the completion of the post-install hook job.

? Deploy the test component? Yes
                                                                                       
  📦 TEST COMPONENT                                                                    
                                                                                       

  ✔  Loading the Zarf State from the Kubernetes cluster                                                                                                                                                                                               
  ✔  Creating port forwarding tunnel at http://127.0.0.1:59316                                                                                                                                                                                        
  ✔  Storing images in the zarf registry                                                                                                                                                                                                              
  ✔  Processing helm chart hookchart:0.1.0 from https://github.com/brandtkeller/testchart.git                                                                                                                                                         
  ✔  Zarf deployment complete

brandtkeller avatar Sep 22 '22 23:09 brandtkeller

output when running with a post-install job (before it finishes):

  ✔  Loading the Zarf State from the Kubernetes cluster                                                                                                                                                                                               
  ✔  Creating port forwarding tunnel at http://127.0.0.1:60538                                                                                                                                                                                        
  ✔  Storing images in the zarf registry                                                                                                                                                                                                              
  ⠸  testjob: Jobs active: 1, jobs failed: 0, jobs succeeded: 0 (19s)   

brandtkeller avatar Sep 23 '22 00:09 brandtkeller

If you're referring to our use of --wait, that's because the expectation in how all zarf components works is that the next component won't run until the previous is ready. Changing that default behavior would kind of break * everything *. That doesn't matter for helm where it doesn't know about multiple helm charts chained together, it does matter for zarf. Think the easy answer is to just expose the option in zarf.yaml. This is actually just another poor helm design from older versions that seems to be carried over: helm/helm#5118.

We could add another mechanism to Zarf to wait for the chart between components in Zarf-land instead of helm itself without breaking everything no?

Racer159 avatar Sep 23 '22 00:09 Racer159

No I don't think so, it's a really non-trivial problem. I spent weeks debugging/analyzing various deployment paradigms and helm's was the most stable for ensuring reconciliation/completion prior to moving on. Helm hooks are gross and what you use when you don't have smarter orchestration such as flux/argo/fleet/knative or some operator framework. Changing this would be a very significant change to Zarf that I am not comfortable for what is an edge case we haven't seen before.

I do support exposing this control to the user though as that seems to easily meet the needs here without changing behavior for every system. Keep in mind even raw manifests/kustomizations are actually turned into helm charts for apply.

Separately, I also see value in more advanced wait logic independent of Helm for things that say are controlled by flux & friends. Something like a wait condition in a component to wait for some arbitrary K8s resource to exist or be ready before proceeding. e.g. "wait for deployment xx to be healthy" or "wait for crd xx to exist". However, I see that as an advanced feature that is separate from this particular concern.

jeff-mccoy avatar Sep 23 '22 01:09 jeff-mccoy

Going to try to implement another field into the zarf.yaml to conditionalize the wait flag to helm

corang avatar Sep 29 '22 19:09 corang