litmus Failed to confirm cluster - Agent status is pending on AWS EKS

What happened:

I am trying to install LitmusChaos on AWS EKS. The agent status is shown as pending. Upon checking the subscriber-agent log, it is displaying the below error.

time="2022-05-20T03:21:02Z" level=info msg="Go Version: go1.16.15"
time="2022-05-20T03:21:02Z" level=info msg="Go OS/Arch: linux/amd64"
time="2022-05-20T03:21:02Z" level=info msg="All agent deployments are up"
time="2022-05-20T03:21:02Z" level=info msg="Starting the subscriber"
time="2022-05-20T03:21:32Z" level=fatal msg="Failed to confirm cluster" data= error="Post \"http://3.12.155.46:31093/query\": dial tcp 3.12.155.46:31093: i/o timeout"

What you expected to happen:

Subscriber agent should be up and running without any issues.

Where can this issue be corrected? (optional) NA

How to reproduce it (as minimally and precisely as possible):

Apply kubectl apply -f https://raw.githubusercontent.com/litmuschaos/litmus/master/mkdocs/docs/2.9.0/litmus-2.9.0.yaml on AWS EKS.
Apply kubectl patch svc litmusportal-frontend-service -p '{"spec": {"type": "LoadBalancer"}}' -n <LITMUS_PORTAL_NAMESPACE> to get the URL
Launch the URL and login
Click ChaosAgents

Anything else we need to know?: k8s is 1.22 on AWS EKS LitmusChaos is 2.9.0

May 20 '22 03:05 QAInsights

Same thing is happening to me, but I installed Litmus using the helm chart instructions and kept the default NodePort service type.

Also, opened ALL traffic from/to my nodes in the Security Group (for troubleshooting purposes), restarted the subscriber pod but I'm still getting the same error.

Any other idea?

May 24 '22 10:05 ryuzakyl

Can you check if the url shown in the error http://3.12.155.46:31093 is accessible from the cluster you are deploying your agent in?

May 24 '22 10:05 gdsoumya

I've created a pod and tried to access the URL with no luck (in my case I get a Connection refused):

bash-5.1# curl -X POST http://10.6.x.y:31365/query
curl: (7) Failed to connect to 10.6.x.y port 31365 after 0 ms: Connection refused

I've gone further and tried reaching all the endpoints corresponding to the NodePort services created (ports :31714, :31720, :31797, :32479 and :31044):

NAME                               TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                         AGE
chaos-exporter                     ClusterIP   172.20.a.b       <none>        8080/TCP                        57m
chaos-litmus-auth-server-service   NodePort    172.20.c.d       <none>        9003:31714/TCP,3030:31720/TCP   58m
chaos-litmus-frontend-service      NodePort    172.20.e.f       <none>        9091:31797/TCP                  58m
chaos-litmus-headless-mongo        ClusterIP   172.20.g.h       <none>        27017/TCP                       58m
chaos-litmus-mongo                 ClusterIP   172.20.i.j       <none>        27017/TCP                       58m
chaos-litmus-server-service        NodePort    172.20.k.l       <none>        9002:32479/TCP,8000:31044/TCP   58m
chaos-operator-metrics             ClusterIP   172.20.m.n       <none>        8383/TCP                        57m
workflow-controller-metrics        ClusterIP   172.20.o.p       <none>        9090/TCP                        57m

and I get a response (no timeout, no connection refused) from ALL of them:

bash-5.1# curl http://10.6.x.y:31714                                              
{"error":"unauthorized","error_description":"The user does not have requested authorization to access this resource"}

bash-5.1# curl http://10.6.x.y:31720
curl: (1) Received HTTP/0.9 when not allowed

bash-5.1# curl http://10.6.x.y:31797                                              
<RESPONSE-CONTENT-REDACTED>

bash-5.1# curl http://10.6.x.y:32479                                                                                                                                 
<RESPONSE-CONTENT-REDACTED>

bash-5.1# curl http://10.6.x.y:31044
curl: (1) Received HTTP/0.9 when not allowed

Looking at that list of ports, I'm not sure where the :31365 port comes from, as it's not one of the dynamically allocated ports for NodePort services.

Could that be a bug or am I missing something here?

May 24 '22 10:05 ryuzakyl

Seems like the nodeport changed after the agent was installed, can you edit the agent-cm configmap int he agent cluster and update the server URL with port 32479 @ryuzakyl

May 24 '22 13:05 gdsoumya

@gdsoumya @ryuzakyl I changed the service front-end and back-end to ClusterIP. Now the agent is up and running.

May 24 '22 17:05 QAInsights

In cluster ip mode you won't be able to get external agents so if you need external agents you will have to change it to nodeport later. I suspect in your case it was either a firewall config issue or the nodeport changed for some reason after agent installation like in case of @ryuzakyl

May 25 '22 03:05 gdsoumya

@gdsoumya You were right here about the NodePort being out of sync. The reason for this, was that the Helm chart does not perform a clean up on the ConfigMaps when we do helm uninstall ....

As I don't need external agents, I decided to switch to ClusterIP services the same way @QAInsights did, and use the .yaml manifest to do the install/uninstall operations of Litmus.

Thanks for the help ;).

Jun 02 '22 09:06 ryuzakyl

@gdsoumya what if we need to configure both self and external agents? What would be the configuration?

Jun 02 '22 13:06 QAInsights

HI @QAInsights , In v2.11.0, The Endpoint for self-agent is now going through FQDN by default. For external you can change the service type to NodePort/LoadBalancer or use Ingress.

Aug 08 '22 10:08 Jonsy13

Hi @QAInsights , Closing this issue, hope your issue was resolved. Feel free to reopen it if the issue persists!!.

Oct 12 '22 13:10 Jonsy13

Hi All, Am getting the same issue, But couldn't figure out the troubleshoot you did. Can you mention more clear info on this. it would be much helpful.

Jan 30 '23 16:01 Shashankreddy6

Hi @QAInsights , Closing this issue, hope your issue was resolved. Feel free to reopen it if the issue persists!!. Hi, looking to enquire the process of this thread. Could anyone help me on this issue. Am confused at nodeport changes here.

Jan 30 '23 21:01 Shashankreddy6

I still face this error on Litmus from v3.0.0..

Dec 07 '23 15:12 abdiakhate

litmus litmus copied to clipboard

Failed to confirm cluster - Agent status is pending on AWS EKS

litmus
litmus copied to clipboard