awx
awx copied to clipboard
Jobs are killed after 4 hours
Please confirm the following
- [X] I agree to follow this project's code of conduct.
- [X] I have checked the current issues for duplicates.
- [X] I understand that AWX is open source software provided for free and that I might not receive a timely response.
- [X] I am NOT reporting a (potential) security vulnerability. (These should be emailed to
[email protected]instead.)
Bug Summary
It seems that this bug is back - https://github.com/ansible/awx/issues/11805
Every job that runs longer than 4 hours gets killed by AWX exactly when 4 hours "limit" is reached
AWX version
23.7.0
Select the relevant components
- [ ] UI
- [ ] UI (tech preview)
- [X] API
- [ ] Docs
- [ ] Collection
- [ ] CLI
- [X] Other
Installation method
kubernetes
Modifications
no
Ansible version
No response
Operating system
Oracle Linux
Web browser
No response
Steps to reproduce
Start a job that runs longer than 4 hours, it gets killed
Expected results
Jobs don't get killed
Actual results
Job gets killed
Additional information
No response
The jobs are in error status with this information:
Failed to JSON parse a line from worker stream. Error: Expecting value: line 1 column 1 (char 0) Line with invalid JSON data: b''
this is due to kube apiserver connection time limit and can be fixed by setting
ee_extra_env: |
- name: RECEPTOR_KUBE_SUPPORT_RECONNECT
value: enabled
please refer to https://github.com/ansible/receptor/pull/683 for further detail
ok, but why did it start happening only recently? older versions of AWX didn't have this problem? I will try to add it to kustomize manifests that install AWX, but I am surprised why is linked receptor issue merged and marked resolved, yet it still affects AWX?
@benapetr the feature landed but users still need to manually enable the flag on the awx spec file to apply the fix. Eventually we will be able to default with this flag enabled, once all users/customers are on the prerequisite k8s version
ok, but why did it start happening only recently?
is it possible that before your jobs did not run for 4 hours?
I also observed this "new" behavior after latest one or two updates of AWX/Operator. And the job did also run more than 4 hrs before:
(that was with):
ansible-playbook [core 2.15.9]
python version = 3.9.18 (main, Jan 4 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] (/usr/bin/python3)
jinja version = 3.1.3
libyaml = True
and kube 1.28.3
But dont ask me which awx version that was. Some recent.
Strange. I will also try the ee_extra_env.
Yes, exactly these jobs ran always for over 10 hours, no problems, unfortunatelly we only run them like once a month or two. Now suddenly they started having problems. We did OS update and AWX updates meanwhile, so I can't track down /when/ it started happening, but I know for sure it worked in the past and now it doesn't by default.
The fix mentioned by @TheRealHaoLiu definitely fixes it though.
@Commifreak enabling RECEPTOR_KUBE_SUPPORT_RECONNECT is certainly recommended, let us know if this helps your long running jobs
Guess what? Without setting the env var, its working again!
What changed in the meantime? => I updated (regular update) to AWX 23.8.1. I dont know if that helped. But I guess its not bad to set this env var anyway.
confused Hao is confused.... we recently flip the default behavior for reconnect to true since we bumped the required kube version
also there were couple bugs we fixed that was caused by some receptor refactoring
closing this issue...