awx-operator
awx-operator copied to clipboard
NFS Issue with AWX operator on OKD 4.6 (ValueError: ZIP does not support timestamps before 1980)
Please confirm the following
- [X] I agree to follow this project's code of conduct.
- [X] I have checked the current issues for duplicates.
- [X] I understand that AWX is open source software provided for free and that I am not entitled to status updates or other assurances.
Summary
We have deployed AWX via AWX operator on OKD 4 . That works fine but when we define NFS based PVC for AWX we encounter with an issue on ansible-runner (automation) pod. The pod is killed immediately cause so many failures related with Python 3.8 .
The issue is related with NFS based PVC definition for AWX . That does not make any sense why it fails when we enabled NFS backend for awx pod while ansible-runner does not even needs it.
Environment
OKD Version: 4.6.0-0.okd-2021-02-14-205305 AWX Operator: 0.13.0 AWX version: AWX 19.3.0
AWX version
AWX 19.3.0
Installation method
openshift
Modifications
no
Ansible version
2.9.25
Operating system
CoreOS 33.20210117.3.2
Web browser
No response
Steps to reproduce
- Deploy AWX Operator 0.13.0 on OKD 4
- Create NFS based PVC for AWX pod and trigger the playbook from AWX UI
Expected results
The playbook should run without any issue .
AWX supports NFS as backend and automation pod should not affect from this configuration.
Actual results
` oc logs -f awx-58dc595755-z4ng2 -c awx-task
2021-09-08 02:40:53,377 ERROR [0bab3b33b7d543c9acf6df1351afbdcc] awx.main.tasks job 5234 (running) Exception occurred while running task Traceback (most recent call last): File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1406, in run res = receptor_job.run() File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 2935, in run return self._run_internal(receptor_ctl) File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 2960, in _run_internal raise transmitter_thread.exc[1].with_traceback(transmitter_thread.exc[2]) File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 2915, in run super().run() File "/usr/lib64/python3.8/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/utils/common.py", line 1094, in wrapper_cleanup_new_process return func(*args, **kwargs) File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3012, in transmit ansible_runner.interface.run(streamer='transmit', _output=_socket.makefile('wb'), **self.runner_params) File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/ansible_runner/interface.py", line 257, in run r.run() File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/ansible_runner/streaming.py", line 53, in run stream_dir(self.private_data_dir, self._output) File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/ansible_runner/utils/streaming.py", line 35, in stream_dir archive.write( File "/usr/lib64/python3.8/zipfile.py", line 1741, in write zinfo = ZipInfo.from_file(filename, arcname, File "/usr/lib64/python3.8/zipfile.py", line 539, in from_file zinfo = cls(arcname, date_time) File "/usr/lib64/python3.8/zipfile.py", line 362, in init raise ValueError('ZIP does not support timestamps before 1980') ValueError: ZIP does not support timestamps before 1980 `
AWX Configuration
spec: admin_email: "" admin_user: admin create_preload_data: true ee_resource_requirements: limits: cpu: 1500m memory: 8Gi requests: cpu: 200m memory: 1Gi garbage_collect_secrets: false image_pull_policy: IfNotPresent ingress_type: route projects_persistence: false projects_storage_access_mode: ReadWriteMany projects_storage_class: managed-nfs-storage projects_storage_size: 10Gi replicas: 1 route_host: ansible.awx.apps.OKD.FQDN route_tls_termination_mechanism: Edge service_type: ClusterIP task_privileged: false task_resource_requirements: limits: cpu: 1000m memory: 8Gi requests: cpu: 200m memory: 1Gi web_resource_requirements: limits: cpu: 1000m memory: 6Gi requests: cpu: 200m memory: 1Gi
Regarding the issues on Github, the automation pod may fail cause of the limits and quotas. We did not define any limit/quota confiration and we do not believe this is related with available compute resources on cluster. external_execution_envs.html#kubernetes-failure-conditions
When I checked the automation pod, It does not even mount the PVC which is defined for AWX. But it causes to failures and pod is terminated immediately.
When we set projects_persistence to false on AWX configuration, it works like a charm. We defined hostpath for PVC and it also works fine.
May there be an issue with NFS locking and AWX locking mechanisms . Should we need to configure PVC definition for executing environments on AWX UI ? automation job PVC/ execution env
Spec of automation Pod There is not any volume definition for PVC and automation does not need it. But using NFS as backend causes to failures.
`spec: containers:
- args:
- ansible-runner
- worker
- --private-data-dir=/runner image: awx-ee:2.9.25 imagePullPolicy: IfNotPresent name: worker resources: {} stdin: true stdinOnce: true terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: default-token-mp4sl readOnly: true dnsPolicy: ClusterFirst enableServiceLinks: true imagePullSecrets:
- name: automation-53139-image-pull-secret-5 nodeName: compute-2.okd-stg.elcld.net preemptionPolicy: PreemptLowerPriority priority: 0 restartPolicy: Never schedulerName: default-scheduler securityContext: {} serviceAccount: default serviceAccountName: default terminationGracePeriodSeconds: 30 tolerations:
- effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 300
- effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 300 volumes:
- name: default-token-mp4sl secret: defaultMode: 420 secretName: default-token-mp4sl`
What is the relationship between AWX and automation pod ?
Additional information
No response
I have the same error when trying to sync host inventory from git repository. Using Longhorn as storage backend on K3S cluster.
Found out that the problem starts to happen when i have community.windows collection in requirements.yml. If i remove the community.windows collection then the sync is working ok again.
I made k8s configmap for Python internal zipfile.py file with "strict_timestamps=False" . I mounted it to AWX task container and then the error with git sync and community collections is not happening anymore and jobs run fine.
Anyone has an idea on what may be the problem?
Here is the mount config:
task_extra_volume_mounts: |
- name: zipfile-py
mountPath: /usr/lib64/python3.8/zipfile.py
subPath: zipfile.py
extra_volumes: |
- name: zipfile-py
configMap:
defaultMode: 420
items:
- key: zipfile.py
path: zipfile.py
name: awx-extra-zipfile
ConfigMap:
kind: ConfigMap
apiVersion: v1
metadata:
name: awx-extra-zipfile
namespace: default
data:
zipfile.py: |-
<content>
strict_timestamps=False
<content>
@tomsozolins - Thank you for your workaround. It works with AWX 19.5.0 which I'd deployed the Operator of v0.16.0.
I have OpenShift cluster 4.5.0-0.okd-2020-07-29-070316
and NFS storage.
My YAML looks like this:
---
apiVersion: v1
kind: Secret
metadata:
name: awx-postgres-configuration
namespace: awx
stringData:
host: "10.1.1.1"
port: "5432"
database: "awx_postgres"
username: "awx_postgres_srv"
password: "awx_password"
sslmode: prefer
# managed: awx operator creates the DB
# unmanaged: awx operator won't create the db
type: unmanaged
type: Opaque
---
apiVersion: v1
kind: Secret
metadata:
name: awx-admin-password
namespace: awx
stringData:
password: "helloworld"
---
apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
name: awx
namespace: awx
spec:
# ingress_type should be defined as route
ingress_type: "route"
# the service type should be set as ClusterIP
# No need to define it as nodeport etc
service_type: "ClusterIP"
# Common name
route_host: "awx.example.net"
# TLS Termination mechanism (Edge, Passthrough)
route_tls_termination_mechanism: "Edge"
# Name of the admin user
admin_user: "admin"
# Email of the admin user
admin_email: "[email protected]"
# The secret resource which contains the password
# admin_password_secret: "awx-admin-password"
# Should the tasks run in privileged containers?
task_privileged: false
# Storage for keeping tasks data
projects_persistence: true
projects_storage_class: "managed-nfs-storage"
projects_storage_size: 100Gi
# Container requirements
web_resource_requirements:
requests:
cpu: 200m
memory: 1Gi
limits:
cpu: 500m
memory: 3Gi
task_resource_requirements:
requests:
cpu: 200m
memory: 1Gi
limits:
cpu: 500m
memory: 2Gi
ee_resource_requirements:
requests:
cpu: 200m
memory: 1Gi
limits:
cpu: 500m
memory: 2Gi
task_extra_volume_mounts: |
- name: zipfile-py
mountPath: /usr/lib64/python3.8/zipfile.py
subPath: zipfile.py
readOnly: true
extra_volumes: |
- name: zipfile-py
configMap:
defaultMode: 420
items:
- key: zipfile.py
path: zipfile.py
name: awx-extra-zipfile
---
kind: ConfigMap
apiVersion: v1
metadata:
name: awx-extra-zipfile
namespace: awx
data:
zipfile.py: |-
....skipped....
....skipped....
class ZipFile:
""" Class with methods to open, read, write, close, list zip files.
z = ZipFile(file, mode="r", compression=ZIP_STORED, allowZip64=True,
compresslevel=None)
file: Either the path to the file, or a file-like object.
If it is a path, the file will be opened and closed by ZipFile.
mode: The mode can be either read 'r', write 'w', exclusive create 'x',
or append 'a'.
compression: ZIP_STORED (no compression), ZIP_DEFLATED (requires zlib),
ZIP_BZIP2 (requires bz2) or ZIP_LZMA (requires lzma).
allowZip64: if True ZipFile will create files with ZIP64 extensions when
needed, otherwise it will raise an exception when this would
be necessary.
compresslevel: None (default for the given compression type) or an integer
specifying the level to pass to the compressor.
When using ZIP_STORED or ZIP_LZMA this keyword has no effect.
When using ZIP_DEFLATED integers 0 through 9 are accepted.
When using ZIP_BZIP2 integers 1 through 9 are accepted.
"""
fp = None # Set here since __del__ checks it
_windows_illegal_name_trans_table = None
def __init__(self, file, mode="r", compression=ZIP_STORED, allowZip64=True,
compresslevel=None, *, strict_timestamps=False):
....skipped....
....skipped....
I can confirm this bug happens on 0.11.0, it isn't related to NFS (we use longhon and ceph), the new version (0.20.0) doesn't have this problem.
We tried to upgrde 0.11.0 to 0.20.0 and everything looked good, the awx started up with the new version but the problem was still present, it seems the upgrade method doesn't get rid of the awx-task container that was used on old versions, so we had to do what was suggested in this issues and used a configmap to patch the python file (0.11.0 uses python 3.9 instead of 3.8, that was the only change in the configmap).
We haven't tried restoring a backup directly to the psql on a new instance of awx (0.20.0) without the awx-task container, but it can probably also fix the issue it just takes extra steps to upgrade to new versions.
In resume: old versions of AWX-operator have this bug (probably <0.14.0), upgrading to the latest version (0.20.0) doesn't fix it because the awx-task container still exist (it doesn't exist on 0.20.0).