awx icon indicating copy to clipboard operation
awx copied to clipboard

Task was marked as running but was not present in the job queue, so it has been marked as failed.

Open deep7861 opened this issue 1 year ago • 3 comments

Please confirm the following

  • [X] I agree to follow this project's code of conduct.
  • [X] I have checked the current issues for duplicates.
  • [X] I understand that AWX is open source software provided for free and that I might not receive a timely response.
  • [X] I am NOT reporting a (potential) security vulnerability. (These should be emailed to [email protected] instead.)

Bug Summary

One of our jobs consistently fails with this error: Task was marked as running but was not present in the job queue, so it has been marked as failed.

image

We haven't been able to identify any resource crunch on k8s cluster, neither the AWX POD are running out of resources.

AWX version

21.3.0

Select the relevant components

  • [ ] UI
  • [ ] UI (tech preview)
  • [ ] API
  • [ ] Docs
  • [ ] Collection
  • [ ] CLI
  • [X] Other

Installation method

kubernetes

Modifications

no

Ansible version

No response

Operating system

No response

Web browser

No response

Steps to reproduce

Our setup: AKS 1.23.8 AWX Operator: 0.24.0 AWX: 21.3.0

This job is connecting to ~30 linux VMs (inventory hosts) and from each VM, contacting ~100 network devices to get output of 3 commands. The output is being stored in a dictionary per inventory host.

The job runs okay when there are lesser network devices (upto 90ish), with always fail with 100.

As the error probably says, the issue should not be in the network or device access or anything else.

Expected results

Play runs smooth and job finishes as expected

Actual results

Job fails with error message: Task was marked as running but was not present in the job queue, so it has been marked as failed.

Additional information

No response

deep7861 avatar Jul 24 '23 05:07 deep7861

@deep7861 you may be running into the k8s max container log issue. Changing this max log size varies depending on your k8s cluster type, but here is a thread that explains it a bit https://github.com/ansible/awx/issues/11338#issuecomment-972708525

the other thing to look into is the receptor reconnect option https://github.com/ansible/receptor/pull/683#issue-1423140057

fosterseth avatar Jul 26 '23 18:07 fosterseth

@fosterseth Thank you for looking into this issue. While I try to find the log size relation, I happen to notice a strange behavior. In some of the posts you mentioned, I saw a suggestion to check the 'result_traceback' value from /api/v2/jobs/job_id for the failed job. Now, when I try doing it - the page doesn't load. Here is what I get: image

When I try to look for that job from usual AWX UI, it fails as well: image

This error appears to happen and I note below log from web container:

2023/08/08 15:28:49 [error] 33#33: *189 upstream prematurely closed connection while reading response header from upstream, client: 10.244.7.25, server: _, req │ │ 10.244.7.25 - - [08/Aug/2023:15:28:49 +0000] "GET /api/v2/unified_jobs/?name__icontains=ine_lm&not__launch_type=sync&order_by=-finished&page=1&page_size=20 HTT │ │ DAMN ! worker 5 (pid: 38) died, killed by signal 9 :( trying respawn ... │ │ Respawned uWSGI worker 5 (new pid: 70) │ │ mounting awx.wsgi:application on / │ │ WSGI app 0 (mountpoint='/') ready in 1 seconds on interpreter 0x7636d0 pid: 70 (default app)

Do we know why this is happening?

deep7861 avatar Aug 08 '23 15:08 deep7861

#9594

bpedersen2 avatar Apr 11 '24 15:04 bpedersen2