awx icon indicating copy to clipboard operation
awx copied to clipboard

'Error opening pod stream' or 'Stream error running pod' randomly on some jobs

Open Mrmel94 opened this issue 1 year ago • 3 comments

Please confirm the following

  • [X] I agree to follow this project's code of conduct.
  • [X] I have checked the current issues for duplicates.
  • [X] I understand that AWX is open source software provided for free and that I might not receive a timely response.

Bug Summary

Hi I have k3s cluster with awx installed on it everything works fine except one thing (and this is very important thing).

Sometimes in a random way 1 job will not start and in the output I have this error :

Error opening pod stream: Get "https://ip_of_my_server:10250/containerLogs/awx/automation-job-567099-7hw6l/worker?follow=true": EOF

or more rarely

Stream error running pod: stdin: error dialing backend: EOF, stdout: http2: response body closed

But if I restart the job it will start and run normally.

I noticed that it happens when several jobs are running at the same time (10 or more), but it not always happens ; for example I have a big workflow with 32 jobs starting in the same time, twice the morning at 7:00 am and 8:00 am, and sometimes both will succeeded, or sometimes in the first workflow 2 or 3 jobs will failed but, the second launch will be ok. It's very weird.

I looked for a way to limit the number of jobs running in parallel on AWX but I couldn't find it, and my number of forks in the AWX configuration is 100. My server has 12 CPUs and 20gb of ram, so I don't think it's a resource issue.

My version of K3S is : v1.24.4+k3s1 My version of AWX is : 21.5.0 My system is : CentOS Linux release 7.9.2009

Maybe someone have a solution for this or a workaround ? Because it's very problematic sometimes when I have some workflows that runs, the workflow failed because of 1 job with this error.

Thanks

AWX version

21.5.0

Select the relevant components

  • [ ] UI
  • [ ] API
  • [ ] Docs
  • [ ] Collection
  • [ ] CLI
  • [X] Other

Installation method

kubernetes

Modifications

no

Ansible version

2.12.5

Operating system

CentOS Linux release 7.9.2009

Web browser

Firefox, Chrome, Safari, Edge

Steps to reproduce

I think you have to run multiple jobs in parallel and sometimes it will failed.

Expected results

All jobs succeeded without theses errors

Actual results

Random jobs failed with errors :

Error opening pod stream: Get "https://ip_of_my_server:10250/containerLogs/awx/automation-job-567099-7hw6l/worker?follow=true": EOF

or more rarely

Stream error running pod: stdin: error dialing backend: EOF, stdout: http2: response body closed

Additional information

No response

Mrmel94 avatar Sep 14 '22 10:09 Mrmel94

Hello, We feel like we have seen this on our mailing list a few times in the past. You can view those instances here: https://groups.google.com/g/awx-project/search?q=error%20opening%20pod%20stream

Please let us know if any of those threads resolve this issue for you.

djyasin avatar Sep 14 '22 15:09 djyasin

Hi, thank you for your answer, but I already search in the mailing list for my problem, but I didn't found any solution. One of the solution is to upgrade k3s in the latest version, what I did, but the problem still there.

Thanks

Mrmel94 avatar Sep 14 '22 15:09 Mrmel94

duplicate of https://github.com/ansible/awx/issues/12549

seems users are still experiencing this bug

fosterseth avatar Sep 16 '22 17:09 fosterseth

Having the same issue, using k3s 1.24

HOSTED-POWER avatar Sep 27 '22 18:09 HOSTED-POWER

Adding that this does appear to be a persistent issue on most modern versions of K3s. I upgraded from 1.24 to 1.25 last evening, and the issue has persisted.

VERSION:
   v1.25.0+k3s1 (26e94057)
Stream error running pod: stdin: error dialing backend: EOF, stdout: http2: response body closed

Like others, the issue appears to manifest when multiple slices, regardless of automation task, are running simultaneously despite plenty of available resources. The issue occurs immediately upon container creation.

Describing the pod yields the following, which was an inventory sync operation as the first step of a job with 3 slices. The pod has no logs when terminated.

Name:                      automation-job-429-x6pp6
Namespace:                 awx
Priority:                  0
Service Account:           default
Node:                      <node>/192.168.2.24
Start Time:                Wed, 28 Sep 2022 08:44:20 -0400
Labels:                    ansible-awx=3320e152-be3c-4482-8155-7d028564aaf8
                           ansible-awx-job-id=429
Annotations:               <none>
Status:                    Terminating (lasts <invalid>)
Termination Grace Period:  30s
IP:                        10.42.0.68
IPs:
  IP:  10.42.0.68
Containers:
  worker:
    Container ID:  containerd://5675b037ed694db09b9ade0c2249de28c733a74e1375c8cb11730ed75ec0a62d
    Image:         <image>
    Image ID:      <image>@sha256:fbc23d802372adb8d0583c2beb0ce43324b9de5adb3d6891adeca52a3121e70e
    Port:          <none>
    Host Port:     <none>
    Args:
      ansible-runner
      worker
      --private-data-dir=/runner
    State:          Running
      Started:      Wed, 28 Sep 2022 08:44:21 -0400
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:        250m
      memory:     100Mi
    Environment:  <none>
    Mounts:       <none>
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:            <none>
QoS Class:          Burstable
Node-Selectors:     <none>
Tolerations:        node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                    node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  11s   default-scheduler  Successfully assigned awx/automation-job-429-x6pp6 to <node>
  Normal  Pulled     11s   kubelet            Container image "<image>" already present on machine
  Normal  Created    11s   kubelet            Created container worker
  Normal  Started    10s   kubelet            Started container worker
  Normal  Killing    8s    kubelet            Stopping container worker

scottharwell avatar Sep 28 '22 12:09 scottharwell

can easily reproduce this on k3d, v1.24.4+k3s1

  1. deploy AWX on k3d
  2. launch a handful of demo jobs at once. A good number will fail with the error Stream error running pod: stdin: error dialing backend: EOF, stdout: http2: response body closed

error occurs when trying to pass stdin into pod using the spdy stream k8s option

https://github.com/ansible/receptor/blob/devel/pkg/workceptor/kubernetes.go#L400

https://pkg.go.dev/k8s.io/client-go/tools/remotecommand#pkg-functions

Long term I'd like a solution that gets away from passing stdin via SPDY. The receptor stdin is accessible inside the cluster already, so mounting that into the AWX job pod via some volume might be better.

However, that wouldn't work for the remote container group case. Maybe a receptor sidecar running inside the remote cluster that can receive stdin and mount it into the pod.

@shanemcd this appears to be an issue for a number of users

fosterseth avatar Sep 29 '22 02:09 fosterseth

Any update on this? It's very severe if you suffer from it... :)

I fixed it for now by downgrading the whole kubernetes to 1.21 as suggested here: https://groups.google.com/g/awx-project/c/LKPpxPpXIAk

HOSTED-POWER avatar Oct 04 '22 22:10 HOSTED-POWER

Any update on this? It's very severe if you suffer from it... :)

I fixed it for now by downgrading the whole kubernetes to 1.21 as suggested here: https://groups.google.com/g/awx-project/c/LKPpxPpXIAk

Upgrading to 21.7.0 on 1.25.2+k3s1 fixed the error for me.

nicolaibaralmueller avatar Oct 05 '22 07:10 nicolaibaralmueller

@HOSTED-POWER @nicolaibaralmueller

I have this upstream receptor PR that I am hoping addresses this issue https://github.com/ansible/receptor/pull/673

One thing I noticed when replicating this issue,

if I start my k3d services with

k3d cluster create sbf, I can't reproduce the problem

if I start my k3d services with

k3d cluster create sbf -p "8043:30080@agent:0" --agents 2

I do get the problem. This second method is how I start the cluster so that I expose the nodeport that is on 30080.

I think the receptor patch I have above will help, but I think there is some environment issue going on too.

fosterseth avatar Oct 05 '22 17:10 fosterseth

Receptor PR merged (see above) and should be fixed in awx-ee:latest as of today.

Redeploying AWX on your k8s clusters should trigger a download of the latest awx-ee image -- no AWX update will be needed to get the fix.

for those who were previously experiencing the bug, please let me know if this helps!

fosterseth avatar Oct 11 '22 05:10 fosterseth

Hi @fosterseth thank you for your work ! But I'm using my own awx-EE which is built with ansible-builder. So I'm using this base image : quay.io/ansible/ansible-runner:latest

There is any solution for me to get the fix with this ?

Edit : In fact I just saw that awx-ee:latest also uses this image, and I think this line in the additional_build_steps: - COPY --from=quay.io/ansible/receptor:devel /usr/bin/receptor should get the fix. I will test if I still have the problem and let you know

Thanks

Mrmel94 avatar Oct 11 '22 07:10 Mrmel94

@Mrmel94 Just for your information, in my understanding the container that should use updated Receptor is not the EE for your Job, but awx-ee container in AWX's Pod. If you don't use custom EE image for control plane, I think all you need to do is just make sure that Kubernetes pulls the new image for awx-ee container again. The way to achieve this depends on your imagePullPolicy.

kurokobo avatar Oct 11 '22 09:10 kurokobo

I am building single image for both control plane EE and actual worker EE based on quay.io/ansible/awx-ee:latest. However, I started to face issue building it with ansible-builder like it described in this question

adpavlov avatar Oct 11 '22 09:10 adpavlov

@Mrmel94 Just for your information, in my understanding the container that should use updated Receptor is not the EE for your Job, but awx-ee container in AWX's Pod. If you don't use custom EE image for control plane, I think all you need to do is just make sure that Kubernetes pulls the new image for awx-ee container again. The way to achieve this depends on your imagePullPolicy.

@kurokobo Yes I think you're right, I made a mistake because I thought that the awx-ee container in the awx pod was the same thing as the default EE for my job which is also called awx-ee (by the way, thanks for your tutorial on your gitlab page, it's what I used to install my k3s cluster with awx a few months ago)

So I change the parameter image_pull_policy from IfNotPresent to Always in the file awx-operator/config/crd/bases/awx.ansible.com_awxs.yaml and redeploy awx and I compared the sha256 of my container awx-ee to quay.io/ansible/awx-ee:latest to make sure I have the latest version.

And since yesterday the problem didn't happen, maybe your fix @fosterseth has resolved the issue (I hope). I will keep you informed in the next days

Thanks

Mrmel94 avatar Oct 13 '22 09:10 Mrmel94

Hi, just for your information, 5 days without any jobs with this error. I think your fix @fosterseth fixed the issue ! Thanks a lot !!

Mrmel94 avatar Oct 17 '22 07:10 Mrmel94

@Mrmel94 thanks for the feedback, that is helpful. Glad this took care of it :)

fosterseth avatar Oct 17 '22 14:10 fosterseth

I was having the same issue in my environment. After awx-ee update, it seems to have stopped. 3 days without error.

lucasvivian avatar Oct 18 '22 13:10 lucasvivian

@lucasvivian thank you for confirming that it is fixed!

fosterseth avatar Oct 26 '22 03:10 fosterseth

closing issue, as it seems to be solved in latest versions of awx.

fosterseth avatar Nov 15 '22 20:11 fosterseth

One of the customer is seeing something similar.

$ kubectl logs automation-controller-task-668d6ddd55-jsm6x -c automation-controller-ee -n ansible-automation-platform | grep -i proxy | tail -2
WARNING 2023/11/29 13:02:56 [oxWaTqzn] Error opening log stream for pod ansible-automation-platform/automation-job-36767-xcls2. Will retry 5 more times. Error: Get "https://10.49.253.6:10250/containerLogs/ansible-automation-platform/automation-job-36767-xcls2/worker?follow=true&sinceTime=0001-01-01T00%3A00%3A00Z&timestamps=true": proxy error from localhost:9443 while dialing 10.49.253.6:10250, code 503: 503 Service Unavailable
WARNING 2023/11/29 13:02:57 [oxWaTqzn] Error opening log stream for pod ansible-automation-platform/automation-job-36767-xcls2. Will retry 4 more times. Error: Get "https://10.49.253.6:10250/containerLogs/ansible-automation-platform/automation-job-36767-xcls2/worker?follow=true&sinceTime=0001-01-01T00%3A00%3A00Z&timestamps=true": proxy error from localhost:9443 while dialing 10.49.253.6:10250, code 503: 503 Service Unavailable

Any clue how we can help them resolve this? Thy are on latest AAP.

bhavikbhavsar avatar Nov 30 '23 12:11 bhavikbhavsar