litmus
litmus copied to clipboard
Failed to confirm cluster - Agent status is pending on AWS EKS
What happened:
I am trying to install LitmusChaos on AWS EKS. The agent status is shown as pending. Upon checking the subscriber-agent
log, it is displaying the below error.
time="2022-05-20T03:21:02Z" level=info msg="Go Version: go1.16.15"
time="2022-05-20T03:21:02Z" level=info msg="Go OS/Arch: linux/amd64"
time="2022-05-20T03:21:02Z" level=info msg="All agent deployments are up"
time="2022-05-20T03:21:02Z" level=info msg="Starting the subscriber"
time="2022-05-20T03:21:32Z" level=fatal msg="Failed to confirm cluster" data= error="Post \"http://3.12.155.46:31093/query\": dial tcp 3.12.155.46:31093: i/o timeout"
What you expected to happen:
Subscriber agent should be up and running without any issues.
Where can this issue be corrected? (optional) NA
How to reproduce it (as minimally and precisely as possible):
- Apply
kubectl apply -f https://raw.githubusercontent.com/litmuschaos/litmus/master/mkdocs/docs/2.9.0/litmus-2.9.0.yaml
on AWS EKS. - Apply
kubectl patch svc litmusportal-frontend-service -p '{"spec": {"type": "LoadBalancer"}}' -n <LITMUS_PORTAL_NAMESPACE>
to get the URL - Launch the URL and login
- Click
ChaosAgents
Anything else we need to know?:
k8s is 1.22
on AWS EKS
LitmusChaos is 2.9.0
Same thing is happening to me, but I installed Litmus using the helm chart instructions and kept the default NodePort
service type.
Also, opened ALL
traffic from/to my nodes in the Security Group (for troubleshooting purposes), restarted the subscriber
pod but I'm still getting the same error.
Any other idea?
Can you check if the url shown in the error http://3.12.155.46:31093
is accessible from the cluster you are deploying your agent in?
I've created a pod and tried to access the URL with no luck (in my case I get a Connection refused
):
bash-5.1# curl -X POST http://10.6.x.y:31365/query
curl: (7) Failed to connect to 10.6.x.y port 31365 after 0 ms: Connection refused
I've gone further and tried reaching all the endpoints corresponding to the NodePort
services created (ports :31714
, :31720
, :31797
, :32479
and :31044
):
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
chaos-exporter ClusterIP 172.20.a.b <none> 8080/TCP 57m
chaos-litmus-auth-server-service NodePort 172.20.c.d <none> 9003:31714/TCP,3030:31720/TCP 58m
chaos-litmus-frontend-service NodePort 172.20.e.f <none> 9091:31797/TCP 58m
chaos-litmus-headless-mongo ClusterIP 172.20.g.h <none> 27017/TCP 58m
chaos-litmus-mongo ClusterIP 172.20.i.j <none> 27017/TCP 58m
chaos-litmus-server-service NodePort 172.20.k.l <none> 9002:32479/TCP,8000:31044/TCP 58m
chaos-operator-metrics ClusterIP 172.20.m.n <none> 8383/TCP 57m
workflow-controller-metrics ClusterIP 172.20.o.p <none> 9090/TCP 57m
and I get a response (no timeout, no connection refused) from ALL of them:
bash-5.1# curl http://10.6.x.y:31714
{"error":"unauthorized","error_description":"The user does not have requested authorization to access this resource"}
bash-5.1# curl http://10.6.x.y:31720
curl: (1) Received HTTP/0.9 when not allowed
bash-5.1# curl http://10.6.x.y:31797
<RESPONSE-CONTENT-REDACTED>
bash-5.1# curl http://10.6.x.y:32479
<RESPONSE-CONTENT-REDACTED>
bash-5.1# curl http://10.6.x.y:31044
curl: (1) Received HTTP/0.9 when not allowed
Looking at that list of ports, I'm not sure where the :31365
port comes from, as it's not one of the dynamically allocated ports for NodePort
services.
Could that be a bug or am I missing something here?
Seems like the nodeport changed after the agent was installed, can you edit the agent-cm configmap int he agent cluster and update the server URL with port 32479 @ryuzakyl
@gdsoumya @ryuzakyl I changed the service front-end
and back-end
to ClusterIP
. Now the agent is up and running.
In cluster ip mode you won't be able to get external agents so if you need external agents you will have to change it to nodeport later. I suspect in your case it was either a firewall config issue or the nodeport changed for some reason after agent installation like in case of @ryuzakyl
@gdsoumya You were right here about the NodePort
being out of sync. The reason for this, was that the Helm chart does not perform a clean up on the ConfigMaps
when we do helm uninstall ...
.
As I don't need external agents, I decided to switch to ClusterIP
services the same way @QAInsights did, and use the .yaml
manifest to do the install/uninstall operations of Litmus.
Thanks for the help ;).
@gdsoumya what if we need to configure both self and external agents? What would be the configuration?
HI @QAInsights , In v2.11.0, The Endpoint for self-agent is now going through FQDN by default. For external you can change the service type to NodePort/LoadBalancer or use Ingress.
Hi @QAInsights , Closing this issue, hope your issue was resolved. Feel free to reopen it if the issue persists!!.
Hi All, Am getting the same issue, But couldn't figure out the troubleshoot you did. Can you mention more clear info on this. it would be much helpful.
Hi @QAInsights , Closing this issue, hope your issue was resolved. Feel free to reopen it if the issue persists!!. Hi, looking to enquire the process of this thread. Could anyone help me on this issue. Am confused at nodeport changes here.
I still face this error on Litmus from v3.0.0..