Support helm install --wait for the chart
Hello,
We are used to using helm install / helm upgrade with the --wait option ; which is nice because the command returns only if install / upgrade is kind of done and not just "commited".
Unfortunately it doesn't currently play really nice with the chart as:
- The migrate database job is set to run "post-install" & "post-upgrade" (https://github.com/apache/airflow/blob/master/chart/templates/migrate-database-job.yaml#L35)
- The init containers actually wait for the migrations to be applied on install.
So if the database hasn't been migrated yet, the install / upgrade fails waiting for the migrations to be applied with the --wait option.
One option would be to switch to run the migrations with pre-install and pre-upgrade hooks instead.
I'd enable this behavior only if we have an external database (in a more production friendly env. where we are more likely to use the --wait option too -?-). If we use a database deployed by the chart, I don't think this could play nice.
What do you think ?
PS: another small question related to this ; on airflow versions bump (with schema migrations to perform), are we expecting the RollingUpdate of pods to work (ie are the schemas backward compatible for one airflow version bump).
Hi,
First of all, let me thank you for this project and your involvement in this awesome project.
Now onto this issue.. I have come to realise that this is also the problem I am encountering. This is also the problem reported in this issue (see the use of --wait): https://github.com/apache/airflow/issues/15340 where it's reported that the webserver fails reporting no migrations ran.
I could not understand why the Job resources would not get created, I could see the ServiceAccount resources but not the jobs. I then created YAML files manually from the helm get values command, and they got created and ran successfully. After looking more into it I have realised that they are Helm Hooks and indeed will not be installed when Helm is used in conjunction with --wait flag in an install/upgrade command.
I would argue that you cannot make assumption on how helm update is run - usually it's as part of a CI/CD pipeline and applies to potentially many other helm releases. I don't think you can force people to forgot '--wait' flag especially because of its nature. As the OP explains, that flag is particularly critical as it waits for the upgrade to be successful or it rolls back.
I think the solution is to not make these jobs as Hooks - but I'm not sure of the reason. In the past I have switched from Hooks to normal job to run my app migrations as part of normal app deployment. Applying migrations multiple time as no-op is a regular use-case.
I would say that this is a serious issue and should be prioritised to allow a smooth install and upgrade. At the moment, fresh install for people using --wait will fail.
Some more info on the problem: https://helm.sh/docs/topics/charts_hooks/ says:
Note that if the --wait flag is set, the library will wait until all resources are in a ready state and will not run the post-install hook until they are ready.
At the same time, the webserver & scheduler wait for the migrations to have run, so we have no READY pod, and it's a dead-end. Eventually the Helm --wait will time out and roll back.
Thank you for your consideration
I will like to look into this @kaxil
In the airflow-helm chart, we implemented a value to fix this called helmWait, it just removes the post-install hook from our db migration job.
However this can result in an issue with "field is immutable", as described our issue: https://github.com/airflow-helm/charts/issues/230
I just got bit by this. Was running it part as part of a CI/CD pipeline using the same arguments as my other helm releases (with --wait). It was confusing why this was happening, noticed that no Jobs were being deployed, started removing arguments one by one until removing --wait worked. Then found this issue :)
This also precludes the use of the --atomic option as well since that internally sets the wait flag.
@kaxil @jedcunningham -> i do agre we have to fix it in the official chart, it's kinda unexpected behaviour and I am sure we can do better than post-install :D
Yeah this is something we have on our list to fix sooner, we hate post-install !!!
This will be either Helm Chart 1.3.0 or 2.0.0
To update everyone, the "user-community" airflow chart no longer uses a helmWait flag to address this issue. As of chart version 8.5.0, we simply have Deployments for all things which previously ran as post-install jobs (like db-migrations).
NOTE: for users who want to still run a post-install job (rather than a deployment), we added the airflow.dbMigrations.runAsJob value
How is the issue going? with Airflow Helm Chart 1.6.0 deployed by argocd, we followed the instrunctions from official documents, installation is ok, but when we update the values.yaml, it will lead to the problem of "field is immutable" of airflow-run-airflow-migrations job. how can we work around this issue to upgrade helm with argocd? any hint?
meanwhile, some secrets include airflow-fernet-key, airflow-broker-url, airflow-redis-password, git-credentials was deleted(in fact, after click the sync button, they are all recreated successfully first, and 10 seconds later, then they are all deleted and not be created again any more), so no pod others can restart successfully because of the lack of git-credentials secret.
I am running into the same issue. My solution was to set this value in the helm chart:
migrateDatabaseJob:
useHelmHooks: false
This solves the problem raised in this issue since it no longer uses the pre-install hook for the migrate db job.
However, I now see an issue sometimes when running helm upgrade --install to patch an existing helm install where it's trying to patch the Job resource, which cannot be done since Jobs are immutable. This causes the helm install to fail. What should be happening is either the Job gets a unique ID with each helm installation so that it's not trying to edit an existing Job, or the Job should be destroyed after it completes. One of the helm hooks that gets added to the Job is
"helm.sh/hook-delete-policy" "before-hook-creation,hook-succeeded"
which I think will handle deleting the Job upon completion, but we can't use the helm hooks for the above reasons. Kube does offer a parameter called .spec.ttlSecondsAfterFinished that, when specified, will delete the Job after the specified number of seconds after completion (see https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/). This is a pretty good solution, though installs done in a period of time shorter than this value might still run into the same problem.
The solution I propose here is to set a helm pre-install hook on the database deployment as well as the db migrations, and set the hook weight on the database lower than that of the migrations, so the database will be created first, then the migrations will run, then the webserver and etc. pods will come up. Then I think we could safely remove the wait-for-db-migrations initContainer on the deployments.
Can you please open a new issue about it @calebwoofenden . I am closing this one as it is old and Helm Hooks disabling is described in the docs - https://airflow.apache.org/docs/helm-chart/stable/index.html#installing-the-chart-with-argo-cd-flux-or-terraform however your problem likely deserve a new issue.
Can you please open a new issue about it @calebwoofenden . I am closing this one as it is old and Helm Hooks disabling is described in the docs - https://airflow.apache.org/docs/helm-chart/stable/index.html#installing-the-chart-with-argo-cd-flux-or-terraform however your problem likely deserve a new issue.
Done: https://github.com/apache/airflow/issues/27561