awx-operator icon indicating copy to clipboard operation
awx-operator copied to clipboard

Added basic liveness- & readinessProbe for awx-web & awx-task containers

Open moonrail opened this issue 2 years ago • 24 comments

Adds basic livenessProbe & readinessProbe for containers web & task. See issue #926 for background on this.

Removes env var AWX_SKIP_MIGRATIONS as it was removed in AWX 18.0.0 version of launch scripts and is not used anymore.

As discussed in this PR: Adds initContainer database-migration that applies migrations or waits if migrations are being applied. This way no launch script can interfere and no AWX pod can come online without having migrations already applied and being unavailable. Removes awx-operator-based db migration handling in installer, as at this point containers already did the job. As a cause of this, wait_timeout had to be increased in installer for the AWX pods deployment to adhere for longer initial startup time of AWX pods when a lot/all migrations have to be applied. I've copied ca-volume-mounts from web & task containers to this new initContainer as well, as these may be required depending on the SSL certificate being used on a (external) database host.

moonrail avatar May 19 '22 14:05 moonrail

Regarding Test-Failure: Task "Create the awx.ansible.com/v1alpha1.AWX": Resource creation timed out

Does not look like a failure introduced with these changes, but rather the test-K8s-cluster not being reachable.

moonrail avatar May 20 '22 08:05 moonrail

@rooftopcellist I like this. I know you had expressed concern about not seeing the migration screen. But I don't care about that, and I don't think most people would. In OCP I don't even think the route will become available until the service is up, and the service won't be up until the pods are ready. So the route will become available when the application is ready and available -- which is the desired state.

kdelee avatar May 24 '22 14:05 kdelee

@moonrail I don't agree that the failure is the tests not reaching the k8s cluster

the test: https://github.com/ansible/awx-operator/blob/3ac0232e89e3fc5a1aa5358ad84d54bb3957d982/molecule/default/tasks/awx_test.yml#L2-L12

the output:

 TASK [Create the awx.ansible.com/v1alpha1.AWX] *********************************
  task path: /home/runner/work/awx-operator/awx-operator/molecule/default/tasks/awx_test.yml:2
  redirecting (type: action) kubernetes.core.k8s to kubernetes.core.k8s_info
  redirecting (type: action) kubernetes.core.k8s to kubernetes.core.k8s_info
  <127.0.0.1> ESTABLISH LOCAL CONNECTION FOR USER: runner
  <127.0.0.1> EXEC /bin/sh -c 'echo ~runner && sleep 0'
  <127.0.0.1> EXEC /bin/sh -c '( umask 77 && mkdir -p "` echo /home/runner/.ansible/tmp `"&& mkdir "` echo /home/runner/.ansible/tmp/ansible-tmp-1653055913.816676-8733-20931721321062 `" && echo ansible-tmp-1653055913.816676-8733-20931721321062="` echo /home/runner/.ansible/tmp/ansible-tmp-1653055913.816676-8733-20931721321062 `" ) && sleep 0'
  Using module file /home/runner/.ansible/collections/ansible_collections/kubernetes/core/plugins/modules/k8s.py
  <127.0.0.1> PUT /home/runner/.ansible/tmp/ansible-local-8725yojx4uon/tmpqexu8m9v TO /home/runner/.ansible/tmp/ansible-tmp-1653055913.816676-8733-20931721321062/AnsiballZ_k8s.py
  <127.0.0.1> EXEC /bin/sh -c 'chmod u+x /home/runner/.ansible/tmp/ansible-tmp-1653055913.816676-8733-20931721321062/ /home/runner/.ansible/tmp/ansible-tmp-1653055913.816676-8733-20931721321062/AnsiballZ_k8s.py && sleep 0'
  <127.0.0.1> EXEC /bin/sh -c '/opt/hostedtoolcache/Python/3.8.12/x64/bin/python /home/runner/.ansible/tmp/ansible-tmp-1653055913.816676-8733-20931721321062/AnsiballZ_k8s.py && sleep 0'
  <127.0.0.1> EXEC /bin/sh -c 'rm -f -r /home/runner/.ansible/tmp/ansible-tmp-1653055913.816676-8733-20931721321062/ > /dev/null 2>&1 && sleep 0'
  fatal: [localhost]: FAILED! => {
      "changed": true,
      "duration": 902,
      "invocation": {
          "module_args": {
              "api_key": null,
              "api_version": "v1",
              "append_hash": false,
              "apply": false,
              "ca_cert": null,
              "client_cert": null,
              "client_key": null,
              "context": null,
              "delete_options": null,
              "force": false,
              "host": null,
              "kind": null,
              "kubeconfig": null,
              "merge_type": null,
              "name": null,
              "namespace": "osdk-test",
              "password": null,
              "persist_config": null,
              "proxy": null,
              "resource_definition": {
                  "apiVersion": "awx.ansible.com/v1beta1",
                  "kind": "AWX",
                  "metadata": {
                      "name": "example-awx",
                      "namespace": "osdk-test"
                  },
                  "spec": {
                      "ee_resource_requirements": {
                          "requests": {
                              "cpu": "200m",
                              "memory": "32M"
                          }
                      },
                      "ingress_annotations": "kubernetes.io/ingress.class: nginx\n",
                      "ingress_type": "ingress",
                      "postgres_init_container_resource_requirements": {},
                      "postgres_resource_requirements": {},
                      "redis_resource_requirements": {},
                      "task_resource_requirements": {
                          "requests": {
                              "cpu": "100m",
                              "memory": "32M"
                          }
                      },
                      "web_resource_requirements": {
                          "requests": {
                              "cpu": "100m",
                              "memory": "32M"
                          }
                      }
                  }
              },
              "src": null,
              "state": "present",
              "template": null,
              "username": null,
              "validate": null,
              "validate_certs": null,
              "wait": true,
              "wait_condition": {
                  "reason": "Successful",
                  "status": "True",
                  "type": "Running"
              },
              "wait_sleep": 5,
              "wait_timeout": 900
          }
      },
      "method": "create",
      "msg": "Resource creation timed out",
      "result": {
          "apiVersion": "awx.ansible.com/v1beta1",
          "kind": "AWX",
          "metadata": {
              "creationTimestamp": "2022-05-20T14:11:54Z",
              "generation": 1,
              "labels": {
                  "app.kubernetes.io/component": "awx",
                  "app.kubernetes.io/managed-by": "awx-operator",
                  "app.kubernetes.io/name": "example-awx",
                  "app.kubernetes.io/operator-version": "",
                  "app.kubernetes.io/part-of": "example-awx"
              },
              "managedFields": [
                  {
                      "apiVersion": "awx.ansible.com/v1beta1",
                      "fieldsType": "FieldsV1",
                      "fieldsV1": {
                          "f:status": {
                              ".": {},
                              "f:conditions": {}
                          }
                      },
                      "manager": "ansible-operator",
                      "operation": "Update",
                      "subresource": "status",
                      "time": "2022-05-20T14:11:54Z"
                  },
                  {
                      "apiVersion": "awx.ansible.com/v1beta1",
                      "fieldsType": "FieldsV1",
                      "fieldsV1": {
                          "f:metadata": {
                              "f:labels": {
                                  ".": {},
                                  "f:app.kubernetes.io/component": {},
                                  "f:app.kubernetes.io/managed-by": {},
                                  "f:app.kubernetes.io/name": {},
                                  "f:app.kubernetes.io/operator-version": {},
                                  "f:app.kubernetes.io/part-of": {}
                              }
                          },
                          "f:spec": {
                              ".": {},
                              "f:admin_user": {},
                              "f:create_preload_data": {},
                              "f:ee_resource_requirements": {
                                  ".": {},
                                  "f:requests": {
                                      ".": {},
                                      "f:cpu": {},
                                      "f:memory": {}
                                  }
                              },
                              "f:garbage_collect_secrets": {},
                              "f:image_pull_policy": {},
                              "f:ingress_annotations": {},
                              "f:ingress_type": {},
                              "f:loadbalancer_port": {},
                              "f:loadbalancer_protocol": {},
                              "f:nodeport_port": {},
                              "f:postgres_init_container_resource_requirements": {},
                              "f:postgres_resource_requirements": {},
                              "f:projects_persistence": {},
                              "f:projects_storage_access_mode": {},
                              "f:projects_storage_size": {},
                              "f:redis_resource_requirements": {},
                              "f:replicas": {},
                              "f:route_tls_termination_mechanism": {},
                              "f:task_privileged": {},
                              "f:task_resource_requirements": {
                                  ".": {},
                                  "f:requests": {
                                      ".": {},
                                      "f:cpu": {},
                                      "f:memory": {}
                                  }
                              },
                              "f:web_resource_requirements": {
                                  ".": {},
                                  "f:requests": {
                                      ".": {},
                                      "f:cpu": {},
                                      "f:memory": {}
                                  }
                              }
                          }
                      },
                      "manager": "OpenAPI-Generator",
                      "operation": "Update",
                      "time": "2022-05-20T14:11:57Z"
                  }
              ],
              "name": "example-awx",
              "namespace": "osdk-test",
              "resourceVersion": "2753",
              "uid": "863cc513-58ea-41e2-a4fc-2fc3cf7d7f2e"
          },
          "spec": {
              "admin_user": "admin",
              "create_preload_data": true,
              "ee_resource_requirements": {
                  "requests": {
                      "cpu": "200m",
                      "memory": "32M"
                  }
              },
              "garbage_collect_secrets": false,
              "image_pull_policy": "IfNotPresent",
              "ingress_annotations": "kubernetes.io/ingress.class: nginx\n",
              "ingress_type": "ingress",
              "loadbalancer_port": 80,
              "loadbalancer_protocol": "http",
              "nodeport_port": 30080,
              "postgres_init_container_resource_requirements": {},
              "postgres_resource_requirements": {},
              "projects_persistence": false,
              "projects_storage_access_mode": "ReadWriteMany",
              "projects_storage_size": "8Gi",
              "redis_resource_requirements": {},
              "replicas": 1,
              "route_tls_termination_mechanism": "Edge",
              "task_privileged": false,
              "task_resource_requirements": {
                  "requests": {
                      "cpu": "100m",
                      "memory": "32M"
                  }
              },
              "web_resource_requirements": {
                  "requests": {
                      "cpu": "100m",
                      "memory": "32M"
                  }
              }
          },
          "status": {
              "conditions": [
                  {
                      "lastTransitionTime": "2022-05-20T14:25:04Z",
                      "reason": "Failed",
                      "status": "False",
                      "type": "Failure"
                  },
                  {
                      "lastTransitionTime": "2022-05-20T14:25:04Z",
                      "reason": "Running",
                      "status": "True",
                      "type": "Running"
                  }
              ]
          }
      }
  }

The key part being:

          "status": {
              "conditions": [
                  {
                      "lastTransitionTime": "2022-05-20T14:25:04Z",
                      "reason": "Failed",
                      "status": "False",
                      "type": "Failure"
                  },
                  {
                      "lastTransitionTime": "2022-05-20T14:25:04Z",
                      "reason": "Running",
                      "status": "True",
                      "type": "Running"
                  }
              ]

I wonder if we need to put a retries on that task. I'm not an expert on these molecule tests, so I may be wrong. My impression is we expect it to not be running for a period of time and then start running

kdelee avatar May 24 '22 17:05 kdelee

Yes, @kdelee is right. These 3 conditions in the molecule test linked above (which CI runs) will not be met until the readiness probe is successful. The probe will not be successful until migrations are complete, which takes significantly longer than just waiting for the pod to be in the running state (what it used to be).

    wait_condition:
      type: Running
      reason: Successful
      status: "True"

I would suggest raising the timeout to 1500s (25 min). Retry logic isn't needed, wait_condition takes care of this for us.

rooftopcellist avatar May 25 '22 04:05 rooftopcellist

Ah, now I understand the molecule testing setup - did not fully grasp it before. Increased wait_timeout to 1500s as suggested.

moonrail avatar May 25 '22 11:05 moonrail

@rooftopcellist @kdelee

Initial db migrations take about 5min on my local minikube setup and on our K8s test environment. The duration of db migrations can vary depending on the setup. So it could be considerably longer. I've tested these changes before issuing this PR on our setup with an external db with no migrations that had to be applied so I didn't notice the amount of time required for them.

The current proposed livenessProbe would restart the Container after about 95 seconds, therefore deployment cannot succeed when migrations are required.

readinessProbe however is not problematic as it is, but I've misunderstood failureThreshold by being absolute. As per docs readinessProbes will be executed in a loop, until a container is ready, not only once: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#configure-probes

I think the current migration execution & waiting on container startup cannot be kept as is for AWX being run in K8s.

With the current AWX startup scripts a livenessProbe cannot be safely implemented, without waiting an unreasonable long time to even probe (what is safe - 15min, 25min, 40min? This cannot be the solution).

Startup-Scripts: Web: https://github.com/ansible/awx/blob/21.0.0/tools/ansible/roles/dockerfile/files/launch_awx.sh Task: https://github.com/ansible/awx/blob/21.0.0/tools/ansible/roles/dockerfile/files/launch_awx_task.sh

Both use: https://github.com/ansible/awx/blob/21.0.0/tools/ansible/roles/dockerfile/files/wait-for-migrations

Would migrations be only applied within an initContainer, liveness- & readinessProbes would only become relevant once migrations were applied successfully, as the actual container would not be started until the initContainer succeeded.

An initContainer would have to be added to task & web, as both can run migrations on their startup and both are started simultaneously. Command of initContainer could be awx-manage migrate || /usr/bin/wait-for-migrations. I'm currently not sure if django returns non-zero return codes when it there is already another running migration, as docs do not mention it: https://docs.djangoproject.com/en/4.0/ref/django-admin/#django-admin-migrate The above command example guesses it does return non-zero on blocking migrations. It does return 0, when no migrations are required.

Waiting for migration-completion is done already on startup of web & task, so every initContainer being run after the first initContainer, could exit without waiting for migrations to complete. But I'd suggest to use the script wait-for-migrations instead (as seen in the example command above), as then livenessProbes would not have to respect anything regarding migrations. All would be done by the time the actual containers are started.

What do you think?

moonrail avatar May 25 '22 18:05 moonrail

@moonrail your idea to run migrations in an init container sounds like a good idea to me. @shanemcd and @Zokormazo may be interested in this suggestion.

It sounds like you would be interested in implementing that in this PR so that the readiness/liveness probes can work sanely?

kdelee avatar May 26 '22 14:05 kdelee

A really cool thing to do would be while the migrations init container is running, that we have another container that is serving that "migrations are running screen" that the service points to. Then once the actual app is up (readiness checks pass) we patch the service to point to the controller api. I know this contradicts what I said the other day about "not caring" if I see the migration screen...still trying to figure out best compromise to give human users feedback that the app is still "installing" e.g. not useble yet, but ostensibly healthy, and give automation the right feedback that the app is not yet ready

kdelee avatar May 26 '22 15:05 kdelee

@kdelee Yes, I could implement my suggestion, but your idea of showing a migration screen exceeds my knowledge on AWX, so I would not be sure I'd be able to implement it as desired.

moonrail avatar May 27 '22 08:05 moonrail

@moonrail I think just doing the init container + the readiness/liveness checks is a great start/best way to begin. If it is desireable to show some "awx is starting" screen before those pass, that seems like a distinct request

kdelee avatar May 31 '22 18:05 kdelee

@kdelee Sounds good, will try to get to it added to this PR this week.

moonrail avatar Jun 01 '22 08:06 moonrail

@moonrail I agree with your idea to run the migrations in an initContainer, I think that is a great approach. FYI, the CI failure was resource related and was fixed here. Rebasing against devel will pull in the fix.

rooftopcellist avatar Jun 01 '22 23:06 rooftopcellist

@kdelee @rooftopcellist Pushed discussed changes and a little cleanup of awx-operator based migration handling. Updated this PRs description with explanation on these changes.

moonrail avatar Jun 03 '22 18:06 moonrail

@Zokormazo any way we can get a downstream test of this change?

kdelee avatar Jun 21 '22 18:06 kdelee

@moonrail Hello! We are in the process of working on merging this PR. This branch will need to be rebased prior to that. Would you mind running that rebase on your end? If not, we can certainly work toward an alternative resolution. Thank you for your time!

djyasin avatar Jul 01 '22 18:07 djyasin

Hi @djyasin branch is now rebased against devel. Did not run tests again, but changes look the same as before, even though some Ansible db-migration code this PR currently removes went from roles/installer/tasks/main.yml to roles/installer/tasks/installer.yml.

moonrail avatar Jul 04 '22 09:07 moonrail

Thank you so much @moonrail!! I will try running it through tests on our end now that this has been rebased.

djyasin avatar Jul 05 '22 13:07 djyasin

I'm a little worried about this change. What happens when scaling the deployment to more than 1? This is going cause the migrations to run every time a new pod is created right?

shanemcd avatar Jul 05 '22 17:07 shanemcd

@shanemcd Correct. Any pod-launch will run the database-migration initContainer. Django checks for required migrations before doing anything and exits gracefully with rc 0, when none are pending. From my tests this seems to work perfectly fine in Django.

When having e.g. 3 replicas and you're upgrading AWX you'll get a 4th pod that will run migrations on the database. Current pods will continue to run until the new pod is ready (and therefore migrations were applied) and are then replaced as well by K8s. So yes, there will be running instances while migrations are being applied, when upgrading AWX on a running cluster. I know of other Django apps, e.g. NetBox, that are handled this way as well when being run in K8s. Depending on the migrations there can be errors while upgrading to a new version, but I'd argue that an upgrade of a database and application to a new version is fairly expected to not be 100% seamless. In case of AWX upgrading should be even more "problematic", when jobs are running. That is why on our end an AWX upgrade includes draining (disabling control instances in AWX), waiting for all jobs in instance groups to complete and only then upgrading AWX.

moonrail avatar Jul 05 '22 18:07 moonrail

Re: draining awx before upgrade, problematic if jobs are running. Yes! This is a problem. It has been discussed the introduction of a "maintenance mode" that control nodes could be placed in to drain them before upgrade. Right now it is left to savvy administrators to figure out (as you all have)

to @shanemcd 's concern, what about a fresh install with replicas: 2 or greater. Is there a race condition on which pod's init container will decide there are migrations to run. Is there a way to ensure that only 1 pod will get the "lock" on who runs migrations when both init containers get to that stage at the same time.

kdelee avatar Jul 05 '22 20:07 kdelee

@kdelee This should be no problem. We certainly do not experience problems with a similar setup on other Django apps in K8s. Django by default uses transactions and therefore PostgreSQL is in the lead of allowing only one parallel migration to run, so any potential race condition (the case where django checks for migrations, does not see any running and then tries to run them) is therefore safe. The initContainers startup command awx-manage migrate || /usr/bin/wait-for-migrations will cause pods that do not have the "migration lock" to wait for running migrations.

moonrail avatar Jul 06 '22 08:07 moonrail

@rooftopcellist @kdelee

Initial db migrations take about 5min on my local minikube setup and on our K8s test environment. The duration of db migrations can vary depending on the setup. So it could be considerably longer. I've tested these changes before issuing this PR on our setup with an external db with no migrations that had to be applied so I didn't notice the amount of time required for them.

The current proposed livenessProbe would restart the Container after about 95 seconds, therefore deployment cannot succeed when migrations are required.

readinessProbe however is not problematic as it is, but I've misunderstood failureThreshold by being absolute. As per docs readinessProbes will be executed in a loop, until a container is ready, not only once: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#configure-probes

I think the current migration execution & waiting on container startup cannot be kept as is for AWX being run in K8s.

With the current AWX startup scripts a livenessProbe cannot be safely implemented, without waiting an unreasonable long time to even probe (what is safe - 15min, 25min, 40min? This cannot be the solution).

Startup-Scripts: Web: https://github.com/ansible/awx/blob/21.0.0/tools/ansible/roles/dockerfile/files/launch_awx.sh Task: https://github.com/ansible/awx/blob/21.0.0/tools/ansible/roles/dockerfile/files/launch_awx_task.sh

Both use: https://github.com/ansible/awx/blob/21.0.0/tools/ansible/roles/dockerfile/files/wait-for-migrations

Would migrations be only applied within an initContainer, liveness- & readinessProbes would only become relevant once migrations were applied successfully, as the actual container would not be started until the initContainer succeeded.

An initContainer would have to be added to task & web, as both can run migrations on their startup and both are started simultaneously. Command of initContainer could be awx-manage migrate || /usr/bin/wait-for-migrations. I'm currently not sure if django returns non-zero return codes when it there is already another running migration, as docs do not mention it: https://docs.djangoproject.com/en/4.0/ref/django-admin/#django-admin-migrate The above command example guesses it does return non-zero on blocking migrations. It does return 0, when no migrations are required.

Waiting for migration-completion is done already on startup of web & task, so every initContainer being run after the first initContainer, could exit without waiting for migrations to complete. But I'd suggest to use the script wait-for-migrations instead (as seen in the example command above), as then livenessProbes would not have to respect anything regarding migrations. All would be done by the time the actual containers are started.

What do you think?

We discussed about that internally and thinking that startup probe will did the job for that case

something like that for web & task

          startupProbe:
            exec:
              command: ["!", "/usr/bin/awx-manage", "showmigrations", "|", "grep", "\'\[ \]\'"]
            initialDelaySeconds: 5
            periodSeconds: 3
            failureThreshold: 900
            successThreshold: 1
            timeoutSeconds: 2

erz4 avatar Jan 22 '23 07:01 erz4

Hello! Any news on this? It would be a nice addition. Thanks

fciava avatar Apr 04 '23 13:04 fciava

As this is still an issue and open. Is there some foreseeable activity/progress on this? I'd like to see this basic feature implemented.

Toothwitch avatar Sep 11 '23 07:09 Toothwitch

Yeah, no, this is not working. There seems to be no real interest on the maintainers side to implement this, so I'll not bother anymore and close this PR, as I am no longer willing to make any changes to it.

Who knows why this basic feature is not desired, maybe to reserve it for other downstream channels, but whatever.

moonrail avatar Mar 05 '24 11:03 moonrail