awx-operator
awx-operator copied to clipboard
K3s issue with Operator version 0.29.0
Please confirm the following
- [X] I agree to follow this project's code of conduct.
- [X] I have checked the current issues for duplicates.
- [X] I understand that the AWX Operator is open source software provided for free and that I might not receive a timely response.
Bug Summary
When deploying AWX 21.6.0 with the latest version of the operator 0.29.0 ok K3s, the awx pod get stuck in "Init:CrashLoopBackOff" status.
AWX 21.5.0 deployed with operator 0.28.0 works as expected.
AWX Operator version
0.29.0
AWX version
21.6.0
Kubernetes platform
other (please specify in additional information)
Kubernetes/Platform version
v1.21.9+k3s1v1.25.0+k3s1
Modifications
no
Steps to reproduce
https://github.com/antuelle78/deploy-awx-k3s-ubuntu
Set: k3s_version: v1.25.0+k3s1 I tested multiple versions as far back as 1.21, same results operator_version: 0.29.0 awx_version: 21.6.0
in https://github.com/antuelle78/deploy-awx-k3s-ubuntu/blob/main/roles/deploy-awx-k3s-ubuntu/defaults/main.yml
Then run the deploy.yml playbook.
Expected results
AWX gets deployed and I can access the web interface at hostIP:30080
Actual results
AWX pod gets stuck in Init:CrashLoopBackOff state.
Additional information
The logs included were save while upgrading from a working configurtion awx 21.5.0 operator 0.28.0 to 21.6.0/0.290.
Operator Logs
https://dpaste.org/pMbqx
I see the same issue in RHEL8.
RHEL8 version: 8.6
K3S Versions:
Client Version: v1.24.4+k3s1
Kustomize Version: v4.5.4
Server Version: v1.24.4+k3s1
Installing RELEASE_TAG=0.28.0
works fine, but with RELEASE_TAG=0.29.0
, the pod gets stuck in Init:CrashLoopBackOff state.
This happens both when doing a clean install per the process from https://computingforgeeks.com/install-and-configure-ansible-awx-on-centos/ or when doing an upgrade from 0.28.0 per the process at https://computingforgeeks.com/how-to-upgrade-ansible-awx-running-in-kubernetes/
@TheRealHaoLiu @shanemcd @rooftopcellist Hi, I think the root cause was came from #1012.
- In 0.28.0 or earlier:
centos:stream8
is used as the image for the init container-
centos:stream8
runs as root, sochmod
andchgrp
for any directories with any perms/owners are allowed
-
- In 0.29.0,
awx-ee
is used for the init container to allow runtime modification of receptor config via #1012-
awx-ee
runs with UID 1000 -
chmod
andchgrp
in the init container won't be allowed if UID:1000 has no perms for the/var/lib/awx/projects
- In some situation, the PV will be mounted with root perms, so the init container will be failed; as in this issue, the default storage provisioner for K3s (local-storage-provisioner) always mount volumes with root perms since it will just create
hostPath
based volume andhostPath
based volume does not respect securityContext.
-
In the current implementation only one init container will be launched, but we can consider to define two init containers; one for certs and receptor using awx-ee
, one for chmod
/chgrp
using centos:stream8
(or any other image with root perms). Or, making the init container runs as root by appending securityContext
under initContainers
is also acceptable.
@antuelle78 Thanks for filling this issue. As a workaround,
- Use 0.28.0 instead
- Use pre-defined hostPath based PV/PVC with pre-defined perms (as my guide do)
- Modify perms for the actual directory for your PV under
/var/lib/rancher/k3s/storage
for UID:1000
This is just for your information for the future, if you've faced some issues in your own playbook, providing minimal reproducible steps like "apply this YAML file" is better way instead of just providing your playbook. If only playbook is provided, it will be diffcult to determine if the issue is with AWX/AWX Operator or with your playbook without reviewing and debugging your code by community. Of course the community has no responsibility to review/debug your playbook, which may prevent the issue from being resolved.
@kurokobo Thank for the guidelines, I will try to respect them next time.
Your work around works as expected. I tested with molecule using vagrant platform on Ubuntu 20.04/22.04 and CentOS Stream 8.
K3s: v1.25.0+k3s1 Operator: 0.29.0 AWX: 21.6.0
I even test upgrading from:
K3s: v1.21.9+k3s1 Operator: 0.28.0 AWX: 21.5.0
And had no issues.
The code is available here: https://github.com/antuelle78/awx-install-on-k3s
In 0.29.0, awx-ee is used for the init container to allow runtime modification of receptor config via https://github.com/ansible/awx-operator/pull/1012 awx-ee runs with UID 1000 ...
* Modify perms for the actual directory for your PV under `/var/lib/rancher/k3s/storage` for UID:1000
Correct me if I'm wrong, but there might be some security concerns with this workaround if there's another existing account on the host with UID 1000?
@doubletwist13 I only set permissions on the data directories:
- name: Create data directory
ansible.builtin.file:
path: "{{ item }}"
state: directory
owner: 1000
group: 0
with_items:
- /data/postgres-13
- /data/projectsj
Thank you community and thank you @kurokobo for the suggested workaround!
@nicolasbouchard-ubi
Your issue is not Init:CrashLoopBackOff
but ImagePullBackOff
so I think it's just misconfiguration for imagePullSecrets/Credentials and at least completely different issue from this topic.
If you think your issue is a bug in AWX Operator/AWX, you should create new issue on appropriate repository. If not, you may get more help with usage questions on the mailing list or IRC: https://github.com/ansible/awx#get-involved
@nicolasbouchard-ubi Your issue is not
Init:CrashLoopBackOff
butImagePullBackOff
so I think it's just misconfiguration for imagePullSecrets/Credentials and at least completely different issue from this topic. If you think your issue is a bug in AWX Operator/AWX, you should create new issue on appropriate repository. If not, you may get more help with usage questions on the mailing list or IRC: https://github.com/ansible/awx#get-involved
Sorry entirely true, I deleted my comment to keep this thread clean.
Thank you @kurokobo for your suggested workarounds.
There is another workaround submitted in #1054 by setting security_context_settings
but in my opinion this can cause some security problems with pods running with root.
Unfortunately, in my case, I am not running AWX in K3S and I am not using local storage provisioner. I cannot imagine either that all my AWX pods running with root for security reason in a production use. So none of the suggested workarounds can work in my case.
@TheRealHaoLiu is any fix planned for this ? As @kurokobo suggests in https://github.com/ansible/awx-operator/issues/1055#issuecomment-1251613435, can we imagine a second init container dedicated for chmod
or running init with a specific securityContext
?
Hi i run awx in 21.7.0 in k3s 1.23 and longhorn, is work fine
I'm having the same issue when using 0.29.0 and above in EKS, have had to revert to 0.28.0 operator. init container error
chmod: changing permissions of '/var/lib/awx/projects': Operation not permitted chgrp: changing group of '/var/lib/awx/projects': Operation not permitted_
Same problem with operator 0.30.0
.
Had some hope in #1051 by @rooftopcellist but it not fixed the problem :cry:
Seeing this in rancher as well. Had to roll back to 0.29.0
Edit: Sorry, I meant to say seeing the issue on 0.29.0, rolled back to 0.28.0 to fix.
I think it's best to stay on 0.29.0 for now since 0.30.0 has this issue: https://github.com/ansible/awx/issues/13002
Unless inventory schedules are not needed.
Have a working fix with #1078.
Waiting to be reviewed :innocent:
I think it's best to stay on 0.29.0 for now since 0.30.0 has this issue: ansible/awx#13002
Hello @antuelle78, this issue has been fixed in AWX and will be included in our next release. Thank you!
@marshmalien Thanks for the heads up
Have a working fix with #1078.
Waiting to be reviewed :innocent:
I just upgraded from 0.27.0 to 0.30.0 and regretted instantly 😄 Hope this fix is merged soon
MR review in progress. Rebase needed then lift off :rocket: :smile:
still the issue happening in 1.1.3 operator.
I am also facing this issue, where all pods comes up and I can access the web UI for the tower. But init container (inside task pod) is still in terminated state. Operator version: 2.0.1. I am using image: awx-ee
and version latest
for spinning up the init container. CC: @FlorianLaunay . @eselvam did it fix for you and how?
Looks like it was starting and then exiting once web and task pod came up. As thats the whole purpose of init container.