awx icon indicating copy to clipboard operation
awx copied to clipboard

Task was marked as running in Tower but was not present in the job queue, so it has been marked as failed

Open bdoublet91 opened this issue 4 years ago • 43 comments

ISSUE TYPE
  • Bug Report
SUMMARY

Sometimes, I have some schedule jobs that failed on awx status job with no logs but the playbook ran on the server. Here it's an update of the system like apt update && apt upgrade.

Sometimes the same schedules jobs succeeded with logs and all is ok

I get the following explanation

Task was marked as running in Tower but was not present in the job queue, so it has been marked as failed.
Identity added: /tmp/awx_537_84tgb_ux/artifacts/537/ssh_key_data (***)
Using /etc/ansible/ansible.cfg as config file
Vault password: 

PLAY [Update and Upgrade all packages on Linux machines] ***********************

TASK [Gathering Facts] *********************************************************

image

ENVIRONMENT
  • AWX version: 15.0.1
  • AWX install method: openshift, minishift, docker on linux, docker for mac, boot2docker
  • Ansible version: 2.9.14
  • Operating System:
cat /etc/os-release 
NAME="CentOS Linux"
VERSION="8 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="CentOS Linux 8 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:8"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-8"
CENTOS_MANTISBT_PROJECT_VERSION="8"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="8"
  • Web Browser: Chrome Version 89.0.4389.82
STEPS TO REPRODUCE

Dont really know

EXPECTED RESULTS

All logs and success jobs

ACTUAL RESULTS

failed jobs random with no logs

I don't use latest version of awx because there are too much bugs that makes awx painfull to use so I don't know if latest version resolve this issue. If you want more information or tests, feel free to ask. Thanks you

bdoublet91 avatar Mar 16 '21 19:03 bdoublet91

Hey @bdoublet91,

Would you give this a shot on a newer version of AWX? Generally this type of issue means a playbook unexpectedly exited during execution, but it could be a bug in AWX itself that's been addressed since 15.0.1 was released.

ryanpetrello avatar Mar 19 '21 17:03 ryanpetrello

I will be able to do the test next week with awx 17.0.1 I think. I'm not sure about unexpectedly exited because 80% of the time, it's succeed. I will add a task to write logs on my server with this playbook to checkout what it's going on. I will update this issue next week

bdoublet91 avatar Mar 19 '21 19:03 bdoublet91

Some news, I have tested with awx 17.1.0 and get no problems about job failed. All schedule jobs worked. I would also test it in 18.0.0 because some bug was fixed but I didn't see the breaking change about data migration and I just broke up my database. Fortunately, I did backup so I will redeploy in version 17.1.0 and migrate to 18.0.0 and check if ti works.

bdoublet91 avatar Mar 29 '21 09:03 bdoublet91

updates there,

I have still problems with my playbooks. I schedules system update at 1:00am with different environments. Same code is executed but it failed often in integration environment without logs whereas it succeed on hears environment image

last_day:

Identity added: /tmp/bwrap_996_m019kc_5/awx_996_du4k_26v/artifacts/996/ssh_key_data
Using /etc/ansible/ansible.cfg as config file
Vault password:

day_before

Identity added: /tmp/bwrap_963_obcg3d7x/awx_963_dw8t1kj2/artifacts/963/ssh_key_data
Using /etc/ansible/ansible.cfg as config file
Vault password: 
PLAY [Update and Upgrade all packages on Linux machines] ***********************
TASK [Gathering Facts] *********************************************************
Identity added: /tmp/bwrap_941_n0haer6o/awx_941_lmzmp9d3/artifacts/941/ssh_key_data
Using /etc/ansible/ansible.cfg as config file
Vault password: 
PLAY [Update and Upgrade all packages on Linux machines] ***********************
TASK [Gathering Facts] *********************************************************
ok: [int_haproxy1]
ok: [int_console1]
ok: [int_postgres2]

and sometimes, it succeeded:

PLAY RECAP *********************************************************************
int_administration1        : ok=11   changed=2    unreachable=0    failed=0    skipped=12   rescued=0    ignored=0   
int_console1               : ok=7    changed=1    unreachable=0    failed=0    skipped=12   rescued=0    ignored=0   
int_haproxy1               : ok=7    changed=1    unreachable=0    failed=0    skipped=12   rescued=0    ignored=0   
int_haproxy2               : ok=7    changed=1    unreachable=0    failed=0    skipped=12   rescued=0    ignored=0

Awx 17.1.0 and awx_task run on docker_swarm. I have no container restart during the execution, the update is well done even if I have no logs in awx. When I relaunch playbook manually, it succeeds.

Another point is: I have setup log_path=/var/log/ansible/update_system.log to keep logs in awx_task container to check if something wrong. Log file are created by the playbook but no logs are write into its. File's date change as well. If I launch my playbook in cli to awx_task container, logs are well writted into the log file.

Do you have any idea or test I can do to resolve suspicious failed ? Does awx have a problem in writing logs into log_path ?

Thanks

Ps: I can't update my AWX version because since 18.0.0, docker is not recommended (supported) by awx installler

EDIT:

This is my ssh configuration for playbook in ansible.cfg, I had once an error

"msg": "Failed to connect to the host via ssh: Failed to add the host to the list of known hosts (/root/.ssh/known_hosts).\r\nmuxclient: master hello exchange failed\r\nFailed to connect to new control master"

[ssh_connection] pipelining=True ssh_args = -o ControlMaster=auto -o ControlPersist=10m -o PreferredAuthentications=publickey control_path = %(directory)s/ansible-ssh-%%h-%%p-%%r

bdoublet91 avatar May 06 '21 15:05 bdoublet91

We seem to have the error "Task was marked as running in Tower but was not present in the job queue, so it has been marked as failed."

This happens as soon as we run multiple jobs at the same time, with a single job it never occurs. We use AWX 19.2.2

I only could find this in the supervisor.log after it happened:

2021-07-22 19:48:15,401 INFO waiting for awx-dispatcher to stop 2021-07-22 19:48:15,417 INFO stopped: awx-dispatcher (terminated by SIGKILL) 2021-07-22 19:48:15,438 INFO waiting for awx-receiver to stop 2021-07-22 19:48:15,446 INFO stopped: awx-receiver (terminated by SIGKILL) 2021-07-22 19:48:15,455 INFO spawned: 'awx-dispatcher' with pid 31298 2021-07-22 19:48:16,616 INFO success: awx-dispatcher entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2021-07-22 19:48:16,619 INFO spawned: 'awx-receiver' with pid 31327 2021-07-22 19:48:17,627 INFO success: awx-receiver entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2021-07-22 19:48:18,034 INFO waiting for awx-rsyslogd to stop 2021-07-22 19:48:18,035 INFO stopped: awx-rsyslogd (exit status 0) 2021-07-22 19:48:18,039 INFO spawned: 'awx-rsyslogd' with pid 31374 2021-07-22 19:48:19,049 INFO success: awx-rsyslogd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

HOSTED-POWER avatar Jul 22 '21 19:07 HOSTED-POWER

I am seeing this as well on K8s, AWX version 19.2.2. This occurs when we have multiple jobs launched at the same time like bdoublet91.

JoelBChapmanII avatar Jul 27 '21 16:07 JoelBChapmanII

I checked many logs but couldn't find it. It's like something is resetted on certain events and that this also cancels / kills the jobs

HOSTED-POWER avatar Jul 27 '21 19:07 HOSTED-POWER

Check NTP for all your nodes. I faced similar issue and NTP for one of the server nodes for Ansible Tower was our of sync. Re-synching them fixed the issue.

afnanqureshi avatar Aug 03 '21 21:08 afnanqureshi

I just have 1 box with 3 docker containers, unlikely there is a time problem?

HOSTED-POWER avatar Aug 03 '21 21:08 HOSTED-POWER

Found this arround timestamp of job failure:

awx-uwsgi stdout | 172.18.0.1 GET /api/v2/jobs/13336/ - HTTP/1.1 200
awx-uwsgi stdout |
awx-uwsgi stdout | 2021-08-04 08:04:29,768 INFO     [4d2e0f6f] awx.api.authentication User hosted-power performed a GET to /api/v2/jobs/13336/job_events/ through the API using OAuth 2 token 43.
awx-uwsgi stdout |
awx-uwsgi stdout | 2021-08-04 08:04:29,811 DEBUG    [4d2e0f6f] awx.analytics.performance request: <WSGIRequest: GET '/api/v2/jobs/13336/job_events/?order_by=start_line&no_truncate=True&start_line__gte=2541'>, response_time: 0.074s
awx-uwsgi stdout |
awx-uwsgi stdout | 172.18.0.1 GET /api/v2/jobs/13336/job_events/?order_by=start_line&no_truncate=True&start_line__gte=2541 - HTTP/1.1 200
awx-uwsgi stdout |
awx-uwsgi stdout | 2021-08-04 08:04:30,115 INFO     [90ad53de] awx.api.authentication User hosted-power performed a GET to /api/v2/jobs/13336/ through the API using OAuth 2 token 43.
awx-uwsgi stdout |
awx-uwsgi stdout | 2021-08-04 08:04:30,157 DEBUG    [90ad53de] awx.analytics.performance request: <WSGIRequest: GET '/api/v2/jobs/13336/'>, response_time: 0.087s
awx-uwsgi stdout |
awx-uwsgi stdout | 172.18.0.1 GET /api/v2/jobs/13336/ - HTTP/1.1 200
awx-uwsgi stdout |
awx-uwsgi stdout | 2021-08-04 08:04:30,209 INFO     [212c08ae] awx.api.authentication User hosted-power performed a GET to /api/v2/jobs/13336/job_events/ through the API using OAuth 2 token 43.
awx-uwsgi stdout |
awx-uwsgi stdout | WSGI app 0 (mountpoint='') ready in 2 seconds on interpreter 0x2411e30 pid: 12608 (default app)
awx-uwsgi stdout |
**awx-uwsgi stdout | running "exec: supervisorctl restart tower-processes:awx-dispatcher tower-processes:awx-receiver" (accepting1)...**
awx-uwsgi stdout |
awx-uwsgi stdout | 2021-08-04 08:04:30,252 DEBUG    [212c08ae] awx.analytics.performance request: <WSGIRequest: GET '/api/v2/jobs/13336/job_events/?order_by=start_line&no_truncate=True&start_line__gte=2546'>, response_time: 0.086s
awx-uwsgi stdout |
awx-uwsgi stdout | 172.18.0.1 GET /api/v2/jobs/13336/job_events/?order_by=start_line&no_truncate=True&start_line__gte=2546 - HTTP/1.1 200
awx-uwsgi stdout |
2021-08-04 08:04:30,412 INFO waiting for awx-dispatcher to stop
2021-08-04 08:04:30,421 INFO stopped: awx-dispatcher (terminated by SIGKILL)
2021-08-04 08:04:30,431 INFO stopped: awx-receiver (terminated by SIGKILL)
2021-08-04 08:04:30,437 INFO spawned: 'awx-dispatcher' with pid 12628
awx-dispatcher stdout | make[1]: Entering directory '/awx_devel'
awx-dispatcher stdout |
awx-uwsgi stdout | 2021-08-04 08:04:30,548 INFO     [385c06bf] awx.api.authentication User hosted-power performed a GET to /api/v2/jobs/13336/ through the API using OAuth 2 token 43.
awx-uwsgi stdout |
awx-receptor stdout | ERROR2021/08/04 08:04:30 Write error in control service: write unix /var/run/receptor/receptor.sock->@: write: broken pipe
awx-receptor stdout | INFO2021/08/04 08:04:30 Client disconnected from control service
awx-receptor stdout |
awx-uwsgi stdout | 2021-08-04 08:04:30,599 DEBUG    [385c06bf] awx.analytics.performance request: <WSGIRequest: GET '/api/v2/jobs/13336/'>, response_time: 0.086s
awx-uwsgi stdout |
awx-uwsgi stdout | 172.18.0.1 GET /api/v2/jobs/13336/ - HTTP/1.1 200
awx-uwsgi stdout |
awx-uwsgi stdout | 2021-08-04 08:04:30,640 INFO     [dd7aedd8] awx.api.authentication User hosted-power performed a GET to /api/v2/jobs/13336/job_events/ through the API using OAuth 2 token 43.
awx-uwsgi stdout |
awx-uwsgi stdout | 2021-08-04 08:04:30,668 DEBUG    [dd7aedd8] awx.analytics.performance request: <WSGIRequest: GET '/api/v2/jobs/13336/job_events/?order_by=start_line&no_truncate=True&start_line__gte=2549'>, response_time: 0.062s
awx-uwsgi stdout |
awx-uwsgi stdout | 172.18.0.1 GET /api/v2/jobs/13336/job_events/?order_by=start_line&no_truncate=True&start_line__gte=2549 - HTTP/1.1 200
awx-uwsgi stdout |
awx-uwsgi stdout | 2021-08-04 08:04:30,963 INFO     [c835d51a] awx.api.authentication User hosted-power performed a GET to /api/v2/jobs/13336/ through the API using OAuth 2 token 43.
awx-uwsgi stdout |

awx-uwsgi stdout | running "exec: supervisorctl restart tower-processes:awx-dispatcher tower-processes:awx-receiver" (accepting1)...

Why would it do this? Seems exactly at the moment we experience our lost task.

HOSTED-POWER avatar Aug 04 '21 08:08 HOSTED-POWER

+1, also seeing this

kimbernator avatar Aug 05 '21 17:08 kimbernator

It's really severe on our side, we can hardly use AWX at this moment, as soon as we launch multiple jobs we get in trouble :(

HOSTED-POWER avatar Aug 10 '21 07:08 HOSTED-POWER

Unfortunately the upgrade to 19.3.0 didn't fix it

HOSTED-POWER avatar Aug 14 '21 21:08 HOSTED-POWER

This isn't particularly scientific, but it appears that this was mitigated for me in some capacity by increasing the memory limit for the awx-task container to 4Gi. I'll do some more testing once I get my kubernetes cluster beefed up a little bit to handle the extra requirements.

kimbernator avatar Sep 03 '21 19:09 kimbernator

We have now 16 GB ram and 6 cpu cores, just for this AWX thing and our issues still aren't resolved. We still get task failed :(

HOSTED-POWER avatar Sep 07 '21 09:09 HOSTED-POWER

@HOSTED-POWER Are those the requests/limits for the awx-task container or the specs of the system it runs on?

kimbernator avatar Sep 07 '21 19:09 kimbernator

It are the specs of the host machine, the containers don't have any limitations configured.

HOSTED-POWER avatar Sep 07 '21 19:09 HOSTED-POWER

In a situation like that I'd consider defining SYSTEM_TASK_ABS_MEM in the operator such that it can't spin up a basically unlimited number of queue workers in the awx-task container. Each one takes 150MB+ of memory itself and spins up an EE pod which could take much more than that, eventually choking the system and preventing jobs from being re-queued.

max worker count when that setting is defined is just the value * 5, so it may take some tweaking and will likely result in a small throughput loss, but it may end up resolving the problem.

kimbernator avatar Sep 08 '21 13:09 kimbernator

I don't fully understand, we have 16 GB ram on the host machine.

The container has no limits.

Where is the problem coming from exactly?

I tried this on the container:

cat /proc/meminfo MemTotal: 16427040 kB

Looks like it reads like 16 Gb, so where is this problem coming from? :)

(Just double checking to make sure it applies)

HOSTED-POWER avatar Sep 08 '21 14:09 HOSTED-POWER

I'm not certain this will be the cause for everyone seeing this error, I can only say that it was what solved it for me. I would encourage you to read my extended description in https://github.com/ansible/awx/issues/11036, but here it is fitted to your scenario:

The awx-task container determines the maximum number of workers it can start simultaneously to handle the lifecycle of EE containers running jobs in the queue based on the available system memory. In your case it's reading 16GB as the system memory budgeted just for queue workers, which calculates to a max of 85 workers ((16GB + 1) * 5). Each worker takes ~150-200MB of memory upon initialization. Obviously given the chance to have enough jobs, the workers alone would consume between 12,750MB and 17,000MB of memory. But that's not budgeting the likely much higher memory consumption of each individual EE pod being spun up, so the 16GB will be fully consumed long before you hit 85 workers, assuming that the EE pods run on the same system. Once all available memory has been used, these workers start dying off and their queued jobs are lost, presumably because there isn't enough memory to process re-queuing them.

By defining SYSTEM_TASK_ABS_MEM, you override AWX's built-in method of determining system memory and limit the maximum queue workers so they can't fill up the available system memory. In the awx-operator, I define that with the following because I gave my awx-task container a limit of 3Gi of memory:

  extra_settings:
    - setting: SYSTEM_TASK_ABS_MEM
      value: "3"

Now I'm limited to 15 concurrent jobs in AWX and I see no more errors.

kimbernator avatar Sep 08 '21 14:09 kimbernator

Ok that would make sense, just one thing I find strange, we never used more than 5 GB of this 16 GB of RAM ... (we log statistics of the host machine)

So I don't think we are in the case of exhausting resources? Or is something else possible?

HOSTED-POWER avatar Sep 08 '21 14:09 HOSTED-POWER

It depends on the other pods in the cluster as well. If they have memory requests that are higher than their actual memory usage, they will prevent any other pod from accessing that memory without actually "using" it. If you're easily able to recreate the situation where the error described here happens, I would suggest running a kubectl describe node <node> while errors are occurring to get the requests/limits of the node to see if requests have come close to/exceeded 100%.

kimbernator avatar Sep 08 '21 15:09 kimbernator

Also, it may be silly but I'd verify that there aren't any default limits on pods in your namespace that might be applying to your pods without you realizing - would be visible in kubectl describe namespace <namespace>

kimbernator avatar Sep 08 '21 15:09 kimbernator

Thx, but we run them in docker at this moment, no parameter is showing sign of overload as far as I can see

In which .py file can I set SYSTEM_TASK_ABS_MEM ? I would try it inside the container as a test otherwise

HOSTED-POWER avatar Sep 09 '21 15:09 HOSTED-POWER

I have it in /etc/tower/settings.py.

kimbernator avatar Sep 09 '21 19:09 kimbernator

Ok I set this:

cat /etc/tower/settings.py
SYSTEM_TASK_ABS_MEM = 3

not sure how this would help ... :)

It's still quite a strange bug

HOSTED-POWER avatar Sep 09 '21 19:09 HOSTED-POWER

This doesn't make any difference, it must be something (totally) different

We keep getting 10's of task failed every day, it's hell and unusable :(

HOSTED-POWER avatar Sep 10 '21 09:09 HOSTED-POWER

Well we migrated to Kubernetes only to see that same problems keep happening. We just see "Error" now , without any extra information.

HOSTED-POWER avatar Oct 25 '21 09:10 HOSTED-POWER

Facing the same issue with 19.2.1 running in Docker, not cool with a 4 hours workflow :D

goldyfruit avatar Oct 27 '21 13:10 goldyfruit

The worker is killed for no reason (that I'm aware of)... Like if uWSGI reached a limit or a timeout.

awx-uwsgi stdout | ...The work of process 530 is done. Seeya!
awx-uwsgi stdout | 
awx-uwsgi stdout | worker 1 killed successfully (pid: 530)
awx-uwsgi stdout | 
awx-uwsgi stdout | Respawned uWSGI worker 1 (new pid: 11323)
awx-uwsgi stdout | 
awx-uwsgi stdout | Python auto-reloader enabled
awx-uwsgi stdout | 
awx-uwsgi stdout | WSGI app 0 (mountpoint='') ready in 1 seconds on interpreter 0x1df6cf0 pid: 11323 (default app)
awx-uwsgi stdout | running "exec: supervisorctl restart tower-processes:awx-dispatcher tower-processes:awx-receiver" (accepting1)...
awx-uwsgi stdout | 
2021-10-27 17:16:47,970 INFO stopped: awx-dispatcher (terminated by SIGKILL)
2021-10-27 17:16:47,971 INFO waiting for awx-receiver to stop
2021-10-27 17:16:47,971 INFO stopped: awx-receiver (terminated by SIGKILL)
2021-10-27 17:16:47,976 INFO spawned: 'awx-dispatcher' with pid 11329
awx-dispatcher stdout | make[1]: Entering directory '/awx_devel'
awx-dispatcher stdout | 
2021-10-27 17:16:48,992 INFO success: awx-dispatcher entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2021-10-27 17:16:48,994 INFO spawned: 'awx-receiver' with pid 11348
awx-receiver stdout | make[1]: Entering directory '/awx_devel'
awx-receiver stdout | 
awx-uwsgi stdout | 2021-10-27 17:16:49,926 INFO     [c35361bd] awx.api.authentication User goldyfruit performed a GET to /api/v2/metrics/ through the API using OAuth 2 token 176.
awx-uwsgi stdout | 
awx-uwsgi stdout | 2021-10-27 17:16:49,982 DEBUG    [c35361bd] awx.analytics.performance request: <WSGIRequest: GET '/api/v2/metrics/'>, response_time: 0.103s
awx-uwsgi stdout | 
awx-uwsgi stdout | 24.225.217.83 GET /api/v2/metrics/ - HTTP/1.1 200
awx-uwsgi stdout | 
2021-10-27 17:16:50,984 INFO success: awx-receiver entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
awx-uwsgi stdout | tower-processes:awx-dispatcher: stopped
awx-uwsgi stdout | tower-processes:awx-receiver: stopped
awx-uwsgi stdout | tower-processes:awx-dispatcher: started
awx-uwsgi stdout | tower-processes:awx-receiver: started
awx-uwsgi stdout | 
awx-dispatcher stdout | 2021-10-27 17:16:51,826 DEBUG    [-] awx.analytics.job_lifecycle job-1656 event processing finished
awx-dispatcher stdout | 
awx-receiver stdout | 2021-10-27 17:16:52,613 WARNING  [-] awx.main.commands.run_callback_receiver scaling up worker pid:11402
awx-receiver stdout | 
awx-receiver stdout | 2021-10-27 17:16:52,618 WARNING  [-] awx.main.commands.run_callback_receiver scaling up worker pid:11403
awx-receiver stdout | 
awx-receiver stdout | 2021-10-27 17:16:52,622 WARNING  [-] awx.main.commands.run_callback_receiver scaling up worker pid:11404
awx-receiver stdout | 
awx-receiver stdout | 2021-10-27 17:16:52,627 WARNING  [-] awx.main.commands.run_callback_receiver scaling up worker pid:11405
awx-receiver stdout | 
awx-uwsgi stdout | 2021-10-27 17:16:54,911 INFO     [9b66dac4] awx.api.authentication User goldyfruit performed a GET to /api/v2/metrics/ through the API using OAuth 2 token 176.
awx-uwsgi stdout | 
awx-uwsgi stdout | 2021-10-27 17:16:54,959 DEBUG    [9b66dac4] awx.analytics.performance request: <WSGIRequest: GET '/api/v2/metrics/'>, response_time: 0.080s
awx-uwsgi stdout | 
awx-uwsgi stdout | 24.225.217.83 GET /api/v2/metrics/ - HTTP/1.1 200
awx-uwsgi stdout | 
awx-dispatcher stdout | 2021-10-27 17:16:57,066 ERROR    [-] awx.main.dispatch job 1656 (failed) is no longer running; reaping
awx-dispatcher stdout | 
awx-receptor stdout | INFO 2021/10/27 17:16:57 Client connected to control service
awx-receptor stdout | 
awx-receptor stdout | INFO 2021/10/27 17:16:57 Control service closed
awx-receptor stdout | INFO 2021/10/27 17:16:57 Client disconnected from control service
awx-receptor stdout | 
2021-10-27 17:16:57,330 INFO waiting for awx-rsyslogd to stop
2021-10-27 17:16:57,331 INFO stopped: awx-rsyslogd (exit status 0)
2021-10-27 17:16:57,334 INFO spawned: 'awx-rsyslogd' with pid 11411
2021-10-27 17:16:58,336 INFO success: awx-rsyslogd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

goldyfruit avatar Oct 27 '21 17:10 goldyfruit