awx
awx copied to clipboard
'Error opening pod stream' or 'Stream error running pod' randomly on some jobs
Please confirm the following
- [X] I agree to follow this project's code of conduct.
- [X] I have checked the current issues for duplicates.
- [X] I understand that AWX is open source software provided for free and that I might not receive a timely response.
Bug Summary
Hi I have k3s cluster with awx installed on it everything works fine except one thing (and this is very important thing).
Sometimes in a random way 1 job will not start and in the output I have this error :
Error opening pod stream: Get "https://ip_of_my_server:10250/containerLogs/awx/automation-job-567099-7hw6l/worker?follow=true": EOF
or more rarely
Stream error running pod: stdin: error dialing backend: EOF, stdout: http2: response body closed
But if I restart the job it will start and run normally.
I noticed that it happens when several jobs are running at the same time (10 or more), but it not always happens ; for example I have a big workflow with 32 jobs starting in the same time, twice the morning at 7:00 am and 8:00 am, and sometimes both will succeeded, or sometimes in the first workflow 2 or 3 jobs will failed but, the second launch will be ok. It's very weird.
I looked for a way to limit the number of jobs running in parallel on AWX but I couldn't find it, and my number of forks in the AWX configuration is 100. My server has 12 CPUs and 20gb of ram, so I don't think it's a resource issue.
My version of K3S is : v1.24.4+k3s1 My version of AWX is : 21.5.0 My system is : CentOS Linux release 7.9.2009
Maybe someone have a solution for this or a workaround ? Because it's very problematic sometimes when I have some workflows that runs, the workflow failed because of 1 job with this error.
Thanks
AWX version
21.5.0
Select the relevant components
- [ ] UI
- [ ] API
- [ ] Docs
- [ ] Collection
- [ ] CLI
- [X] Other
Installation method
kubernetes
Modifications
no
Ansible version
2.12.5
Operating system
CentOS Linux release 7.9.2009
Web browser
Firefox, Chrome, Safari, Edge
Steps to reproduce
I think you have to run multiple jobs in parallel and sometimes it will failed.
Expected results
All jobs succeeded without theses errors
Actual results
Random jobs failed with errors :
Error opening pod stream: Get "https://ip_of_my_server:10250/containerLogs/awx/automation-job-567099-7hw6l/worker?follow=true": EOF
or more rarely
Stream error running pod: stdin: error dialing backend: EOF, stdout: http2: response body closed
Additional information
No response
Hello, We feel like we have seen this on our mailing list a few times in the past. You can view those instances here: https://groups.google.com/g/awx-project/search?q=error%20opening%20pod%20stream
Please let us know if any of those threads resolve this issue for you.
Hi, thank you for your answer, but I already search in the mailing list for my problem, but I didn't found any solution. One of the solution is to upgrade k3s in the latest version, what I did, but the problem still there.
Thanks
duplicate of https://github.com/ansible/awx/issues/12549
seems users are still experiencing this bug
Having the same issue, using k3s 1.24
Adding that this does appear to be a persistent issue on most modern versions of K3s. I upgraded from 1.24
to 1.25
last evening, and the issue has persisted.
VERSION:
v1.25.0+k3s1 (26e94057)
Stream error running pod: stdin: error dialing backend: EOF, stdout: http2: response body closed
Like others, the issue appears to manifest when multiple slices, regardless of automation task, are running simultaneously despite plenty of available resources. The issue occurs immediately upon container creation.
Describing the pod yields the following, which was an inventory sync operation as the first step of a job with 3 slices. The pod has no logs when terminated.
Name: automation-job-429-x6pp6
Namespace: awx
Priority: 0
Service Account: default
Node: <node>/192.168.2.24
Start Time: Wed, 28 Sep 2022 08:44:20 -0400
Labels: ansible-awx=3320e152-be3c-4482-8155-7d028564aaf8
ansible-awx-job-id=429
Annotations: <none>
Status: Terminating (lasts <invalid>)
Termination Grace Period: 30s
IP: 10.42.0.68
IPs:
IP: 10.42.0.68
Containers:
worker:
Container ID: containerd://5675b037ed694db09b9ade0c2249de28c733a74e1375c8cb11730ed75ec0a62d
Image: <image>
Image ID: <image>@sha256:fbc23d802372adb8d0583c2beb0ce43324b9de5adb3d6891adeca52a3121e70e
Port: <none>
Host Port: <none>
Args:
ansible-runner
worker
--private-data-dir=/runner
State: Running
Started: Wed, 28 Sep 2022 08:44:21 -0400
Ready: True
Restart Count: 0
Requests:
cpu: 250m
memory: 100Mi
Environment: <none>
Mounts: <none>
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes: <none>
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 11s default-scheduler Successfully assigned awx/automation-job-429-x6pp6 to <node>
Normal Pulled 11s kubelet Container image "<image>" already present on machine
Normal Created 11s kubelet Created container worker
Normal Started 10s kubelet Started container worker
Normal Killing 8s kubelet Stopping container worker
can easily reproduce this on k3d, v1.24.4+k3s1
- deploy AWX on k3d
- launch a handful of demo jobs at once. A good number will fail with the error
Stream error running pod: stdin: error dialing backend: EOF, stdout: http2: response body closed
error occurs when trying to pass stdin into pod using the spdy stream k8s option
https://github.com/ansible/receptor/blob/devel/pkg/workceptor/kubernetes.go#L400
https://pkg.go.dev/k8s.io/client-go/tools/remotecommand#pkg-functions
Long term I'd like a solution that gets away from passing stdin via SPDY. The receptor stdin is accessible inside the cluster already, so mounting that into the AWX job pod via some volume might be better.
However, that wouldn't work for the remote container group case. Maybe a receptor sidecar running inside the remote cluster that can receive stdin and mount it into the pod.
@shanemcd this appears to be an issue for a number of users
Any update on this? It's very severe if you suffer from it... :)
I fixed it for now by downgrading the whole kubernetes to 1.21 as suggested here: https://groups.google.com/g/awx-project/c/LKPpxPpXIAk
Any update on this? It's very severe if you suffer from it... :)
I fixed it for now by downgrading the whole kubernetes to 1.21 as suggested here: https://groups.google.com/g/awx-project/c/LKPpxPpXIAk
Upgrading to 21.7.0 on 1.25.2+k3s1 fixed the error for me.
@HOSTED-POWER @nicolaibaralmueller
I have this upstream receptor PR that I am hoping addresses this issue https://github.com/ansible/receptor/pull/673
One thing I noticed when replicating this issue,
if I start my k3d services with
k3d cluster create sbf
, I can't reproduce the problem
if I start my k3d services with
k3d cluster create sbf -p "8043:30080@agent:0" --agents 2
I do get the problem. This second method is how I start the cluster so that I expose the nodeport that is on 30080.
I think the receptor patch I have above will help, but I think there is some environment issue going on too.
Receptor PR merged (see above) and should be fixed in awx-ee:latest
as of today.
Redeploying AWX on your k8s clusters should trigger a download of the latest awx-ee image -- no AWX update will be needed to get the fix.
for those who were previously experiencing the bug, please let me know if this helps!
Hi @fosterseth thank you for your work ! But I'm using my own awx-EE which is built with ansible-builder. So I'm using this base image : quay.io/ansible/ansible-runner:latest
There is any solution for me to get the fix with this ?
Edit : In fact I just saw that awx-ee:latest also uses this image, and I think this line in the additional_build_steps: - COPY --from=quay.io/ansible/receptor:devel /usr/bin/receptor should get the fix. I will test if I still have the problem and let you know
Thanks
@Mrmel94
Just for your information, in my understanding the container that should use updated Receptor is not the EE for your Job, but awx-ee
container in AWX's Pod.
If you don't use custom EE image for control plane, I think all you need to do is just make sure that Kubernetes pulls the new image for awx-ee
container again. The way to achieve this depends on your imagePullPolicy
.
I am building single image for both control plane EE and actual worker EE based on quay.io/ansible/awx-ee:latest
. However, I started to face issue building it with ansible-builder like it described in this question
@Mrmel94 Just for your information, in my understanding the container that should use updated Receptor is not the EE for your Job, but awx-ee container in AWX's Pod. If you don't use custom EE image for control plane, I think all you need to do is just make sure that Kubernetes pulls the new image for awx-ee container again. The way to achieve this depends on your imagePullPolicy.
@kurokobo Yes I think you're right, I made a mistake because I thought that the awx-ee container in the awx pod was the same thing as the default EE for my job which is also called awx-ee (by the way, thanks for your tutorial on your gitlab page, it's what I used to install my k3s cluster with awx a few months ago)
So I change the parameter image_pull_policy
from IfNotPresent
to Always
in the file awx-operator/config/crd/bases/awx.ansible.com_awxs.yaml
and redeploy awx and I compared the sha256 of my container awx-ee
to quay.io/ansible/awx-ee:latest
to make sure I have the latest version.
And since yesterday the problem didn't happen, maybe your fix @fosterseth has resolved the issue (I hope). I will keep you informed in the next days
Thanks
Hi, just for your information, 5 days without any jobs with this error. I think your fix @fosterseth fixed the issue ! Thanks a lot !!
@Mrmel94 thanks for the feedback, that is helpful. Glad this took care of it :)
I was having the same issue in my environment. After awx-ee update, it seems to have stopped. 3 days without error.
@lucasvivian thank you for confirming that it is fixed!
closing issue, as it seems to be solved in latest versions of awx.
One of the customer is seeing something similar.
$ kubectl logs automation-controller-task-668d6ddd55-jsm6x -c automation-controller-ee -n ansible-automation-platform | grep -i proxy | tail -2
WARNING 2023/11/29 13:02:56 [oxWaTqzn] Error opening log stream for pod ansible-automation-platform/automation-job-36767-xcls2. Will retry 5 more times. Error: Get "https://10.49.253.6:10250/containerLogs/ansible-automation-platform/automation-job-36767-xcls2/worker?follow=true&sinceTime=0001-01-01T00%3A00%3A00Z×tamps=true": proxy error from localhost:9443 while dialing 10.49.253.6:10250, code 503: 503 Service Unavailable
WARNING 2023/11/29 13:02:57 [oxWaTqzn] Error opening log stream for pod ansible-automation-platform/automation-job-36767-xcls2. Will retry 4 more times. Error: Get "https://10.49.253.6:10250/containerLogs/ansible-automation-platform/automation-job-36767-xcls2/worker?follow=true&sinceTime=0001-01-01T00%3A00%3A00Z×tamps=true": proxy error from localhost:9443 while dialing 10.49.253.6:10250, code 503: 503 Service Unavailable
Any clue how we can help them resolve this? Thy are on latest AAP.