awx-operator icon indicating copy to clipboard operation
awx-operator copied to clipboard

NFS Issue with AWX operator on OKD 4.6 (ValueError: ZIP does not support timestamps before 1980)

Open mrcetinel opened this issue 3 years ago • 4 comments

Please confirm the following

  • [X] I agree to follow this project's code of conduct.
  • [X] I have checked the current issues for duplicates.
  • [X] I understand that AWX is open source software provided for free and that I am not entitled to status updates or other assurances.

Summary

We have deployed AWX via AWX operator on OKD 4 . That works fine but when we define NFS based PVC for AWX we encounter with an issue on ansible-runner (automation) pod. The pod is killed immediately cause so many failures related with Python 3.8 .

The issue is related with NFS based PVC definition for AWX . That does not make any sense why it fails when we enabled NFS backend for awx pod while ansible-runner does not even needs it.

Environment

OKD Version: 4.6.0-0.okd-2021-02-14-205305 AWX Operator: 0.13.0 AWX version: AWX 19.3.0

AWX version

AWX 19.3.0

Installation method

openshift

Modifications

no

Ansible version

2.9.25

Operating system

CoreOS 33.20210117.3.2

Web browser

No response

Steps to reproduce

  • Deploy AWX Operator 0.13.0 on OKD 4
  • Create NFS based PVC for AWX pod and trigger the playbook from AWX UI

Expected results

The playbook should run without any issue .

AWX supports NFS as backend and automation pod should not affect from this configuration.

Actual results

` oc logs -f awx-58dc595755-z4ng2 -c awx-task

2021-09-08 02:40:53,377 ERROR [0bab3b33b7d543c9acf6df1351afbdcc] awx.main.tasks job 5234 (running) Exception occurred while running task Traceback (most recent call last): File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1406, in run res = receptor_job.run() File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 2935, in run return self._run_internal(receptor_ctl) File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 2960, in _run_internal raise transmitter_thread.exc[1].with_traceback(transmitter_thread.exc[2]) File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 2915, in run super().run() File "/usr/lib64/python3.8/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/utils/common.py", line 1094, in wrapper_cleanup_new_process return func(*args, **kwargs) File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3012, in transmit ansible_runner.interface.run(streamer='transmit', _output=_socket.makefile('wb'), **self.runner_params) File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/ansible_runner/interface.py", line 257, in run r.run() File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/ansible_runner/streaming.py", line 53, in run stream_dir(self.private_data_dir, self._output) File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/ansible_runner/utils/streaming.py", line 35, in stream_dir archive.write( File "/usr/lib64/python3.8/zipfile.py", line 1741, in write zinfo = ZipInfo.from_file(filename, arcname, File "/usr/lib64/python3.8/zipfile.py", line 539, in from_file zinfo = cls(arcname, date_time) File "/usr/lib64/python3.8/zipfile.py", line 362, in init raise ValueError('ZIP does not support timestamps before 1980') ValueError: ZIP does not support timestamps before 1980 `

AWX Configuration

spec: admin_email: "" admin_user: admin create_preload_data: true ee_resource_requirements: limits: cpu: 1500m memory: 8Gi requests: cpu: 200m memory: 1Gi garbage_collect_secrets: false image_pull_policy: IfNotPresent ingress_type: route projects_persistence: false projects_storage_access_mode: ReadWriteMany projects_storage_class: managed-nfs-storage projects_storage_size: 10Gi replicas: 1 route_host: ansible.awx.apps.OKD.FQDN route_tls_termination_mechanism: Edge service_type: ClusterIP task_privileged: false task_resource_requirements: limits: cpu: 1000m memory: 8Gi requests: cpu: 200m memory: 1Gi web_resource_requirements: limits: cpu: 1000m memory: 6Gi requests: cpu: 200m memory: 1Gi

Regarding the issues on Github, the automation pod may fail cause of the limits and quotas. We did not define any limit/quota confiration and we do not believe this is related with available compute resources on cluster. external_execution_envs.html#kubernetes-failure-conditions

When I checked the automation pod, It does not even mount the PVC which is defined for AWX. But it causes to failures and pod is terminated immediately.

When we set projects_persistence to false on AWX configuration, it works like a charm. We defined hostpath for PVC and it also works fine.

May there be an issue with NFS locking and AWX locking mechanisms . Should we need to configure PVC definition for executing environments on AWX UI ? automation job PVC/ execution env

Spec of automation Pod There is not any volume definition for PVC and automation does not need it. But using NFS as backend causes to failures.

`spec: containers:

  • args:
    • ansible-runner
    • worker
    • --private-data-dir=/runner image: awx-ee:2.9.25 imagePullPolicy: IfNotPresent name: worker resources: {} stdin: true stdinOnce: true terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts:
    • mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: default-token-mp4sl readOnly: true dnsPolicy: ClusterFirst enableServiceLinks: true imagePullSecrets:
  • name: automation-53139-image-pull-secret-5 nodeName: compute-2.okd-stg.elcld.net preemptionPolicy: PreemptLowerPriority priority: 0 restartPolicy: Never schedulerName: default-scheduler securityContext: {} serviceAccount: default serviceAccountName: default terminationGracePeriodSeconds: 30 tolerations:
  • effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 300
  • effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 300 volumes:
  • name: default-token-mp4sl secret: defaultMode: 420 secretName: default-token-mp4sl`

What is the relationship between AWX and automation pod ?

image

Additional information

No response

mrcetinel avatar Sep 08 '21 12:09 mrcetinel

I have the same error when trying to sync host inventory from git repository. Using Longhorn as storage backend on K3S cluster.

Found out that the problem starts to happen when i have community.windows collection in requirements.yml. If i remove the community.windows collection then the sync is working ok again.

tomsozolins avatar Sep 09 '21 20:09 tomsozolins

I made k8s configmap for Python internal zipfile.py file with "strict_timestamps=False" . I mounted it to AWX task container and then the error with git sync and community collections is not happening anymore and jobs run fine.

Anyone has an idea on what may be the problem?

Here is the mount config:

task_extra_volume_mounts: |
    - name: zipfile-py
      mountPath: /usr/lib64/python3.8/zipfile.py
      subPath: zipfile.py

extra_volumes: |
    - name: zipfile-py
      configMap:
        defaultMode: 420
        items:
          - key: zipfile.py
            path: zipfile.py
        name: awx-extra-zipfile

ConfigMap:

kind: ConfigMap
apiVersion: v1
metadata:
  name: awx-extra-zipfile
  namespace: default
data:
  zipfile.py: |-
    <content>
    strict_timestamps=False
    <content>

tomsozolins avatar Sep 16 '21 07:09 tomsozolins

@tomsozolins - Thank you for your workaround. It works with AWX 19.5.0 which I'd deployed the Operator of v0.16.0.

I have OpenShift cluster 4.5.0-0.okd-2020-07-29-070316 and NFS storage.

My YAML looks like this:

---
apiVersion: v1
kind: Secret
metadata:
  name: awx-postgres-configuration
  namespace: awx
stringData:
  host: "10.1.1.1"
  port: "5432"
  database: "awx_postgres"
  username: "awx_postgres_srv"
  password: "awx_password"
  sslmode: prefer
  # managed: awx operator creates the DB
  # unmanaged: awx operator won't create the db
  type: unmanaged
type: Opaque
---
apiVersion: v1
kind: Secret
metadata:
  name: awx-admin-password
  namespace: awx
stringData:
  password: "helloworld"
---
apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
  name: awx
  namespace: awx
spec:
  # ingress_type should be defined as route
  ingress_type: "route"
  # the service type should be set as ClusterIP
  # No need to define it as nodeport etc
  service_type: "ClusterIP"
  # Common name
  route_host: "awx.example.net"
  # TLS Termination mechanism (Edge, Passthrough)
  route_tls_termination_mechanism: "Edge"
  # Name of the admin user
  admin_user: "admin"
  # Email of the admin user
  admin_email: "[email protected]"
  # The secret resource which contains the password
  # admin_password_secret: "awx-admin-password"
  # Should the tasks run in privileged containers?
  task_privileged: false
  # Storage for keeping tasks data
  projects_persistence: true
  projects_storage_class: "managed-nfs-storage"
  projects_storage_size: 100Gi
  # Container requirements
  web_resource_requirements:
    requests:
      cpu: 200m
      memory: 1Gi
    limits:
      cpu: 500m
      memory: 3Gi
  task_resource_requirements:
    requests:
      cpu: 200m
      memory: 1Gi
    limits:
      cpu: 500m
      memory: 2Gi
  ee_resource_requirements:
    requests:
      cpu: 200m
      memory: 1Gi
    limits:
      cpu: 500m
      memory: 2Gi

  task_extra_volume_mounts: |
    - name: zipfile-py
      mountPath: /usr/lib64/python3.8/zipfile.py
      subPath: zipfile.py
      readOnly: true

  extra_volumes: |
    - name: zipfile-py
      configMap:
        defaultMode: 420
        items:
          - key: zipfile.py
            path: zipfile.py
        name: awx-extra-zipfile

---
kind: ConfigMap
apiVersion: v1
metadata:
  name: awx-extra-zipfile
  namespace: awx
data:
  zipfile.py: |-
    ....skipped....
    ....skipped....
    class ZipFile:
        """ Class with methods to open, read, write, close, list zip files.
    
        z = ZipFile(file, mode="r", compression=ZIP_STORED, allowZip64=True,
                    compresslevel=None)
    
        file: Either the path to the file, or a file-like object.
              If it is a path, the file will be opened and closed by ZipFile.
        mode: The mode can be either read 'r', write 'w', exclusive create 'x',
              or append 'a'.
        compression: ZIP_STORED (no compression), ZIP_DEFLATED (requires zlib),
                     ZIP_BZIP2 (requires bz2) or ZIP_LZMA (requires lzma).
        allowZip64: if True ZipFile will create files with ZIP64 extensions when
                    needed, otherwise it will raise an exception when this would
                    be necessary.
        compresslevel: None (default for the given compression type) or an integer
                       specifying the level to pass to the compressor.
                       When using ZIP_STORED or ZIP_LZMA this keyword has no effect.
                       When using ZIP_DEFLATED integers 0 through 9 are accepted.
                       When using ZIP_BZIP2 integers 1 through 9 are accepted.
    
        """
    
        fp = None                   # Set here since __del__ checks it
        _windows_illegal_name_trans_table = None
    
        def __init__(self, file, mode="r", compression=ZIP_STORED, allowZip64=True,
                     compresslevel=None, *, strict_timestamps=False):

       ....skipped....
       ....skipped....

zentavr avatar Jan 27 '22 13:01 zentavr

I can confirm this bug happens on 0.11.0, it isn't related to NFS (we use longhon and ceph), the new version (0.20.0) doesn't have this problem.

We tried to upgrde 0.11.0 to 0.20.0 and everything looked good, the awx started up with the new version but the problem was still present, it seems the upgrade method doesn't get rid of the awx-task container that was used on old versions, so we had to do what was suggested in this issues and used a configmap to patch the python file (0.11.0 uses python 3.9 instead of 3.8, that was the only change in the configmap).

We haven't tried restoring a backup directly to the psql on a new instance of awx (0.20.0) without the awx-task container, but it can probably also fix the issue it just takes extra steps to upgrade to new versions.

In resume: old versions of AWX-operator have this bug (probably <0.14.0), upgrading to the latest version (0.20.0) doesn't fix it because the awx-task container still exist (it doesn't exist on 0.20.0).

koshrf avatar Apr 14 '22 17:04 koshrf