awx
awx copied to clipboard
Task was marked as running but was not present in the job queue, so it has been marked as failed.
Please confirm the following
- [X] I agree to follow this project's code of conduct.
- [X] I have checked the current issues for duplicates.
- [X] I understand that AWX is open source software provided for free and that I might not receive a timely response.
- [X] I am NOT reporting a (potential) security vulnerability. (These should be emailed to
[email protected]
instead.)
Bug Summary
One of our jobs consistently fails with this error: Task was marked as running but was not present in the job queue, so it has been marked as failed.
We haven't been able to identify any resource crunch on k8s cluster, neither the AWX POD are running out of resources.
AWX version
21.3.0
Select the relevant components
- [ ] UI
- [ ] UI (tech preview)
- [ ] API
- [ ] Docs
- [ ] Collection
- [ ] CLI
- [X] Other
Installation method
kubernetes
Modifications
no
Ansible version
No response
Operating system
No response
Web browser
No response
Steps to reproduce
Our setup: AKS 1.23.8 AWX Operator: 0.24.0 AWX: 21.3.0
This job is connecting to ~30 linux VMs (inventory hosts) and from each VM, contacting ~100 network devices to get output of 3 commands. The output is being stored in a dictionary per inventory host.
The job runs okay when there are lesser network devices (upto 90ish), with always fail with 100.
As the error probably says, the issue should not be in the network or device access or anything else.
Expected results
Play runs smooth and job finishes as expected
Actual results
Job fails with error message: Task was marked as running but was not present in the job queue, so it has been marked as failed.
Additional information
No response
@deep7861 you may be running into the k8s max container log issue. Changing this max log size varies depending on your k8s cluster type, but here is a thread that explains it a bit https://github.com/ansible/awx/issues/11338#issuecomment-972708525
the other thing to look into is the receptor reconnect option https://github.com/ansible/receptor/pull/683#issue-1423140057
@fosterseth Thank you for looking into this issue.
While I try to find the log size relation, I happen to notice a strange behavior.
In some of the posts you mentioned, I saw a suggestion to check the 'result_traceback' value from /api/v2/jobs/job_id for the failed job.
Now, when I try doing it - the page doesn't load. Here is what I get:
When I try to look for that job from usual AWX UI, it fails as well:
This error appears to happen and I note below log from web container:
2023/08/08 15:28:49 [error] 33#33: *189 upstream prematurely closed connection while reading response header from upstream, client: 10.244.7.25, server: _, req │ │ 10.244.7.25 - - [08/Aug/2023:15:28:49 +0000] "GET /api/v2/unified_jobs/?name__icontains=ine_lm¬__launch_type=sync&order_by=-finished&page=1&page_size=20 HTT │ │ DAMN ! worker 5 (pid: 38) died, killed by signal 9 :( trying respawn ... │ │ Respawned uWSGI worker 5 (new pid: 70) │ │ mounting awx.wsgi:application on / │ │ WSGI app 0 (mountpoint='/') ready in 1 seconds on interpreter 0x7636d0 pid: 70 (default app)
Do we know why this is happening?
#9594