awx icon indicating copy to clipboard operation
awx copied to clipboard

Jobs are killed after 4 hours

Open benapetr opened this issue 1 year ago • 4 comments

Please confirm the following

  • [X] I agree to follow this project's code of conduct.
  • [X] I have checked the current issues for duplicates.
  • [X] I understand that AWX is open source software provided for free and that I might not receive a timely response.
  • [X] I am NOT reporting a (potential) security vulnerability. (These should be emailed to [email protected] instead.)

Bug Summary

It seems that this bug is back - https://github.com/ansible/awx/issues/11805

Every job that runs longer than 4 hours gets killed by AWX exactly when 4 hours "limit" is reached

AWX version

23.7.0

Select the relevant components

  • [ ] UI
  • [ ] UI (tech preview)
  • [X] API
  • [ ] Docs
  • [ ] Collection
  • [ ] CLI
  • [X] Other

Installation method

kubernetes

Modifications

no

Ansible version

No response

Operating system

Oracle Linux

Web browser

No response

Steps to reproduce

Start a job that runs longer than 4 hours, it gets killed

Expected results

Jobs don't get killed

Actual results

Job gets killed

Additional information

No response

benapetr avatar Feb 13 '24 10:02 benapetr

The jobs are in error status with this information:

Failed to JSON parse a line from worker stream. Error: Expecting value: line 1 column 1 (char 0) Line with invalid JSON data: b''

benapetr avatar Feb 13 '24 10:02 benapetr

this is due to kube apiserver connection time limit and can be fixed by setting

ee_extra_env: |
  - name: RECEPTOR_KUBE_SUPPORT_RECONNECT
    value: enabled

please refer to https://github.com/ansible/receptor/pull/683 for further detail

TheRealHaoLiu avatar Feb 14 '24 15:02 TheRealHaoLiu

ok, but why did it start happening only recently? older versions of AWX didn't have this problem? I will try to add it to kustomize manifests that install AWX, but I am surprised why is linked receptor issue merged and marked resolved, yet it still affects AWX?

benapetr avatar Feb 15 '24 11:02 benapetr

@benapetr the feature landed but users still need to manually enable the flag on the awx spec file to apply the fix. Eventually we will be able to default with this flag enabled, once all users/customers are on the prerequisite k8s version

ok, but why did it start happening only recently?

is it possible that before your jobs did not run for 4 hours?

fosterseth avatar Feb 15 '24 19:02 fosterseth

I also observed this "new" behavior after latest one or two updates of AWX/Operator. And the job did also run more than 4 hrs before: grafik

(that was with):

ansible-playbook [core 2.15.9]
  python version = 3.9.18 (main, Jan  4 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] (/usr/bin/python3)
  jinja version = 3.1.3
  libyaml = True

and kube 1.28.3

But dont ask me which awx version that was. Some recent.

Strange. I will also try the ee_extra_env.

Commifreak avatar Feb 20 '24 05:02 Commifreak

Yes, exactly these jobs ran always for over 10 hours, no problems, unfortunatelly we only run them like once a month or two. Now suddenly they started having problems. We did OS update and AWX updates meanwhile, so I can't track down /when/ it started happening, but I know for sure it worked in the past and now it doesn't by default.

The fix mentioned by @TheRealHaoLiu definitely fixes it though.

benapetr avatar Feb 20 '24 10:02 benapetr

@Commifreak enabling RECEPTOR_KUBE_SUPPORT_RECONNECT is certainly recommended, let us know if this helps your long running jobs

fosterseth avatar Feb 21 '24 18:02 fosterseth

Guess what? Without setting the env var, its working again!

grafik

What changed in the meantime? => I updated (regular update) to AWX 23.8.1. I dont know if that helped. But I guess its not bad to set this env var anyway.

Commifreak avatar Feb 22 '24 05:02 Commifreak

confused Hao is confused.... we recently flip the default behavior for reconnect to true since we bumped the required kube version

also there were couple bugs we fixed that was caused by some receptor refactoring

closing this issue...

TheRealHaoLiu avatar Mar 12 '24 19:03 TheRealHaoLiu