awx-operator icon indicating copy to clipboard operation
awx-operator copied to clipboard

K3s issue with Operator version 0.29.0

Open antuelle78 opened this issue 2 years ago • 7 comments

Please confirm the following

  • [X] I agree to follow this project's code of conduct.
  • [X] I have checked the current issues for duplicates.
  • [X] I understand that the AWX Operator is open source software provided for free and that I might not receive a timely response.

Bug Summary

When deploying AWX 21.6.0 with the latest version of the operator 0.29.0 ok K3s, the awx pod get stuck in "Init:CrashLoopBackOff" status.

AWX 21.5.0 deployed with operator 0.28.0 works as expected.

AWX Operator version

0.29.0

AWX version

21.6.0

Kubernetes platform

other (please specify in additional information)

Kubernetes/Platform version

v1.21.9+k3s1v1.25.0+k3s1

Modifications

no

Steps to reproduce

https://github.com/antuelle78/deploy-awx-k3s-ubuntu

Set: k3s_version: v1.25.0+k3s1 I tested multiple versions as far back as 1.21, same results operator_version: 0.29.0 awx_version: 21.6.0

in https://github.com/antuelle78/deploy-awx-k3s-ubuntu/blob/main/roles/deploy-awx-k3s-ubuntu/defaults/main.yml

Then run the deploy.yml playbook.

Expected results

AWX gets deployed and I can access the web interface at hostIP:30080

Actual results

AWX pod gets stuck in Init:CrashLoopBackOff state.

Additional information

The logs included were save while upgrading from a working configurtion awx 21.5.0 operator 0.28.0 to 21.6.0/0.290.

Operator Logs

https://dpaste.org/pMbqx

antuelle78 avatar Sep 19 '22 17:09 antuelle78

I see the same issue in RHEL8.

RHEL8 version: 8.6
K3S Versions:
Client Version: v1.24.4+k3s1
Kustomize Version: v4.5.4
Server Version: v1.24.4+k3s1

Installing RELEASE_TAG=0.28.0 works fine, but with RELEASE_TAG=0.29.0, the pod gets stuck in Init:CrashLoopBackOff state.

This happens both when doing a clean install per the process from https://computingforgeeks.com/install-and-configure-ansible-awx-on-centos/ or when doing an upgrade from 0.28.0 per the process at https://computingforgeeks.com/how-to-upgrade-ansible-awx-running-in-kubernetes/

doubletwist13 avatar Sep 19 '22 19:09 doubletwist13

@TheRealHaoLiu @shanemcd @rooftopcellist Hi, I think the root cause was came from #1012.

  • In 0.28.0 or earlier: centos:stream8 is used as the image for the init container
    • centos:stream8 runs as root, so chmod and chgrp for any directories with any perms/owners are allowed
  • In 0.29.0, awx-ee is used for the init container to allow runtime modification of receptor config via #1012
    • awx-ee runs with UID 1000
    • chmod and chgrp in the init container won't be allowed if UID:1000 has no perms for the /var/lib/awx/projects
    • In some situation, the PV will be mounted with root perms, so the init container will be failed; as in this issue, the default storage provisioner for K3s (local-storage-provisioner) always mount volumes with root perms since it will just create hostPath based volume and hostPath based volume does not respect securityContext.

In the current implementation only one init container will be launched, but we can consider to define two init containers; one for certs and receptor using awx-ee, one for chmod/chgrp using centos:stream8 (or any other image with root perms). Or, making the init container runs as root by appending securityContext under initContainers is also acceptable.

kurokobo avatar Sep 19 '22 22:09 kurokobo

@antuelle78 Thanks for filling this issue. As a workaround,

  • Use 0.28.0 instead
  • Use pre-defined hostPath based PV/PVC with pre-defined perms (as my guide do)
  • Modify perms for the actual directory for your PV under /var/lib/rancher/k3s/storage for UID:1000

This is just for your information for the future, if you've faced some issues in your own playbook, providing minimal reproducible steps like "apply this YAML file" is better way instead of just providing your playbook. If only playbook is provided, it will be diffcult to determine if the issue is with AWX/AWX Operator or with your playbook without reviewing and debugging your code by community. Of course the community has no responsibility to review/debug your playbook, which may prevent the issue from being resolved.

kurokobo avatar Sep 19 '22 22:09 kurokobo

@kurokobo Thank for the guidelines, I will try to respect them next time.

Your work around works as expected. I tested with molecule using vagrant platform on Ubuntu 20.04/22.04 and CentOS Stream 8.

K3s: v1.25.0+k3s1 Operator: 0.29.0 AWX: 21.6.0

I even test upgrading from:

K3s: v1.21.9+k3s1 Operator: 0.28.0 AWX: 21.5.0

And had no issues.

The code is available here: https://github.com/antuelle78/awx-install-on-k3s

antuelle78 avatar Sep 20 '22 11:09 antuelle78

In 0.29.0, awx-ee is used for the init container to allow runtime modification of receptor config via https://github.com/ansible/awx-operator/pull/1012 awx-ee runs with UID 1000 ...

* Modify perms for the actual directory for your PV under `/var/lib/rancher/k3s/storage` for UID:1000

Correct me if I'm wrong, but there might be some security concerns with this workaround if there's another existing account on the host with UID 1000?

doubletwist13 avatar Sep 20 '22 14:09 doubletwist13

@doubletwist13 I only set permissions on the data directories:

  • name: Create data directory ansible.builtin.file: path: "{{ item }}" state: directory owner: 1000 group: 0 with_items:
    • /data/postgres-13
    • /data/projectsj

antuelle78 avatar Sep 20 '22 14:09 antuelle78

Thank you community and thank you @kurokobo for the suggested workaround!

djyasin avatar Sep 21 '22 17:09 djyasin

@nicolasbouchard-ubi Your issue is not Init:CrashLoopBackOff but ImagePullBackOff so I think it's just misconfiguration for imagePullSecrets/Credentials and at least completely different issue from this topic. If you think your issue is a bug in AWX Operator/AWX, you should create new issue on appropriate repository. If not, you may get more help with usage questions on the mailing list or IRC: https://github.com/ansible/awx#get-involved

kurokobo avatar Sep 26 '22 22:09 kurokobo

@nicolasbouchard-ubi Your issue is not Init:CrashLoopBackOff but ImagePullBackOff so I think it's just misconfiguration for imagePullSecrets/Credentials and at least completely different issue from this topic. If you think your issue is a bug in AWX Operator/AWX, you should create new issue on appropriate repository. If not, you may get more help with usage questions on the mailing list or IRC: https://github.com/ansible/awx#get-involved

Sorry entirely true, I deleted my comment to keep this thread clean.

nicolasbouchard-ubi avatar Sep 27 '22 15:09 nicolasbouchard-ubi

Thank you @kurokobo for your suggested workarounds.

There is another workaround submitted in #1054 by setting security_context_settings but in my opinion this can cause some security problems with pods running with root.

Unfortunately, in my case, I am not running AWX in K3S and I am not using local storage provisioner. I cannot imagine either that all my AWX pods running with root for security reason in a production use. So none of the suggested workarounds can work in my case.

@TheRealHaoLiu is any fix planned for this ? As @kurokobo suggests in https://github.com/ansible/awx-operator/issues/1055#issuecomment-1251613435, can we imagine a second init container dedicated for chmod or running init with a specific securityContext ?

FlorianLaunay avatar Sep 30 '22 10:09 FlorianLaunay

Hi i run awx in 21.7.0 in k3s 1.23 and longhorn, is work fine

chris93111 avatar Oct 06 '22 10:10 chris93111

I'm having the same issue when using 0.29.0 and above in EKS, have had to revert to 0.28.0 operator. init container error

chmod: changing permissions of '/var/lib/awx/projects': Operation not permitted chgrp: changing group of '/var/lib/awx/projects': Operation not permitted_

leetcarey avatar Oct 06 '22 12:10 leetcarey

Same problem with operator 0.30.0.

Had some hope in #1051 by @rooftopcellist but it not fixed the problem :cry:

FlorianLaunay avatar Oct 06 '22 15:10 FlorianLaunay

Seeing this in rancher as well. Had to roll back to 0.29.0

Edit: Sorry, I meant to say seeing the issue on 0.29.0, rolled back to 0.28.0 to fix.

f0rkz avatar Oct 06 '22 15:10 f0rkz

I think it's best to stay on 0.29.0 for now since 0.30.0 has this issue: https://github.com/ansible/awx/issues/13002

Unless inventory schedules are not needed.

antuelle78 avatar Oct 06 '22 16:10 antuelle78

Have a working fix with #1078.

Waiting to be reviewed :innocent:

FlorianLaunay avatar Oct 06 '22 16:10 FlorianLaunay

I think it's best to stay on 0.29.0 for now since 0.30.0 has this issue: ansible/awx#13002

Hello @antuelle78, this issue has been fixed in AWX and will be included in our next release. Thank you!

marshmalien avatar Oct 06 '22 20:10 marshmalien

@marshmalien Thanks for the heads up

antuelle78 avatar Oct 08 '22 13:10 antuelle78

Have a working fix with #1078.

Waiting to be reviewed :innocent:

I just upgraded from 0.27.0 to 0.30.0 and regretted instantly 😄 Hope this fix is merged soon

mateuszdrab avatar Oct 19 '22 22:10 mateuszdrab

MR review in progress. Rebase needed then lift off :rocket: :smile:

FlorianLaunay avatar Nov 03 '22 16:11 FlorianLaunay

still the issue happening in 1.1.3 operator.

eselvam avatar Jan 06 '23 05:01 eselvam

I am also facing this issue, where all pods comes up and I can access the web UI for the tower. But init container (inside task pod) is still in terminated state. Operator version: 2.0.1. I am using image: awx-ee and version latest for spinning up the init container. CC: @FlorianLaunay . @eselvam did it fix for you and how?

vshete93 avatar Jun 06 '23 19:06 vshete93

Looks like it was starting and then exiting once web and task pod came up. As thats the whole purpose of init container.

vshete93 avatar Jun 27 '23 19:06 vshete93