litmus icon indicating copy to clipboard operation
litmus copied to clipboard

Subscriber Restart Disconnects Cluster External Agent

Open cryslam opened this issue 3 years ago • 5 comments

What happened: Have observed that anytime the subscriber pods in my other clusters have restarted, then the connection to the control plane is disconnected. Logs show that cluster connection was established and listening but then there is graphql error stating cluster is already connected. Until I do a rollout restart on the subscriber deployment, the external agent then updates in control plane/ chaos center to Active.

What you expected to happen: I expect the connection to always be Active until I disconnect the agent from control plane.

Anything else we need to know?: Logs below from subscriber pod

kubectl logs subscriber-8dcbf4885-8f27x -n litmus
time="2022-04-27T16:57:46Z" level=info msg="Go Version: go1.16.14"
time="2022-04-27T16:57:46Z" level=info msg="Go OS/Arch: linux/amd64"
time="2022-04-27T16:57:46Z" level=info msg="all deployments up"
time="2022-04-27T16:57:46Z" level=info msg="all components live...starting up subscriber"
time="2022-04-27T16:57:46Z" level=info msg="connecting to ws://${OUR _INTERNAL_LB}:9002/query"
time="2022-04-27T16:57:46Z" level=info msg="WORKFLOW EVENT a1806278-fbf0-4999-ac55-ac0a1d35802b ADD"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-compute-storage-chaos-1649965880-3913627000\" not found"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-compute-storage-chaos-1649965880-1500482059\" not found"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-compute-storage-chaos-1649965880-3788775506\" not found"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-compute-storage-chaos-1649965880-2389293678\" not found"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-compute-storage-chaos-1649965880-1410664282\" not found"
time="2022-04-27T16:57:46Z" level=info msg="WORKFLOW EVENT baaf162d-0563-4f9e-a2f0-5fbcbfde9eb5 ADD"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"custom-chaos-workflow-1650309445-688291500\" not found"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"custom-chaos-workflow-1650309445-776107758\" not found"
time="2022-04-27T16:57:46Z" level=info msg="WORKFLOW EVENT b6cbc4a2-795c-4ef6-a154-a4b666448b8f ADD"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-compute-storage-chaos-1650310299-656459710\" not found"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-compute-storage-chaos-1650310299-3123653494\" not found"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-compute-storage-chaos-1650310299-733461431\" not found"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-compute-storage-chaos-1650310299-1824021690\" not found"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-compute-storage-chaos-1650310299-3503105988\" not found"
time="2022-04-27T16:57:46Z" level=info msg="WORKFLOW EVENT 7ac54e9d-2d6e-458d-9a40-04fff38ade01 ADD"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"custom-chaos-workflow-1650406982-441984963\" not found"
time="2022-04-27T16:57:46Z" level=info msg="WORKFLOW EVENT 3a41a565-163f-484c-8887-1406bc0e56de ADD"
time="2022-04-27T16:57:46Z" level=info msg="WORKFLOW EVENT 3e415275-0269-498e-8ed7-f48f1bd784c6 ADD"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="chaosengines.litmuschaos.io \"app-pod-network-loss\" not found"
time="2022-04-27T16:57:46Z" level=info msg="WORKFLOW EVENT 7efdaf1d-e4a7-4859-bb83-f00e720ded97 ADD"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-compute-storage-chaos-1649968443-2536634243\" not found"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-compute-storage-chaos-1649968443-451715378\" not found"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-compute-storage-chaos-1649968443-480145366\" not found"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-compute-storage-chaos-1649968443-3275782938\" not found"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-compute-storage-chaos-1649968443-255623040\" not found"
time="2022-04-27T16:57:46Z" level=info msg="WORKFLOW EVENT 0de9db7b-5f20-4c8d-9c84-f72db42204d1 ADD"
time="2022-04-27T16:57:46Z" level=info msg="RESPONSE {\"data\":{\"chaosWorkflowRun\":\"Workflow Run Discarded[Duplicate Event]\"}}"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-network-chaos-1650466188-3569519041\" not found"
time="2022-04-27T16:57:46Z" level=info msg="RESPONSE {\"data\":{\"chaosWorkflowRun\":\"Workflow Run Discarded[Duplicate Event]\"}}"
time="2022-04-27T16:57:46Z" level=info msg="RESPONSE {\"data\":{\"chaosWorkflowRun\":\"Workflow Run Discarded[Duplicate Event]\"}}"
time="2022-04-27T16:57:46Z" level=info msg="RESPONSE {\"data\":{\"chaosWorkflowRun\":\"Workflow Run Discarded[Duplicate Event]\"}}"
time="2022-04-27T16:57:46Z" level=info msg="RESPONSE {\"data\":{\"chaosWorkflowRun\":\"Workflow Run Discarded[Duplicate Event]\"}}"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-network-chaos-1650466188-3775858327\" not found"
time="2022-04-27T16:57:46Z" level=info msg="RESPONSE {\"data\":{\"chaosWorkflowRun\":\"Workflow Run Discarded[Duplicate Event]\"}}"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-network-chaos-1650466188-1910317501\" not found"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-network-chaos-1650466188-2297947959\" not found"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-network-chaos-1650466188-2906505791\" not found"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-network-chaos-1650466188-102234551\" not found"
time="2022-04-27T16:57:46Z" level=info msg="WORKFLOW EVENT 0dbd5a10-9c84-475a-be4f-4dc762a47488 ADD"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="chaosengines.litmuschaos.io \"app-pod-network-loss\" not found"
time="2022-04-27T16:57:46Z" level=info msg="WORKFLOW EVENT 93115e89-50e9-4db4-ad43-3bd9ac8010b5 ADD"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"kiali-app-compute-storage-chaos-1650336013-1477581973\" not found"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"kiali-app-compute-storage-chaos-1650336013-2617216295\" not found"
time="2022-04-27T16:57:46Z" level=info msg="RESPONSE {\"data\":{\"chaosWorkflowRun\":\"Workflow Run Discarded[Duplicate Event]\"}}"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"kiali-app-compute-storage-chaos-1650336013-2544508318\" not found"
time="2022-04-27T16:57:46Z" level=info msg="RESPONSE {\"data\":{\"chaosWorkflowRun\":\"Workflow Run Discarded[Duplicate Event]\"}}"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"kiali-app-compute-storage-chaos-1650336013-1938889127\" not found"
time="2022-04-27T16:57:46Z" level=info msg="RESPONSE {\"data\":{\"chaosWorkflowRun\":\"Workflow Run Discarded[Duplicate Event]\"}}"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"kiali-app-compute-storage-chaos-1650336013-343645195\" not found"
time="2022-04-27T16:57:46Z" level=info msg="WORKFLOW EVENT bfaadf81-0531-4aed-9c82-18aeb1d1041d ADD"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-compute-storage-chaos-1650337005-2930724730\" not found"
time="2022-04-27T16:57:46Z" level=info msg="RESPONSE {\"data\":{\"chaosWorkflowRun\":\"Workflow Run Discarded[Duplicate Event]\"}}"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-compute-storage-chaos-1650337005-4150785784\" not found"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-compute-storage-chaos-1650337005-502783056\" not found"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-compute-storage-chaos-1650337005-73255085\" not found"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-compute-storage-chaos-1650337005-73306492\" not found"
time="2022-04-27T16:57:46Z" level=info msg="WORKFLOW EVENT 918c3f55-99b6-4c8d-bcb7-1f144a4ba0b3 ADD"
time="2022-04-27T16:57:46Z" level=info msg="RESPONSE {\"data\":{\"chaosWorkflowRun\":\"Workflow Run Discarded[Duplicate Event]\"}}"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"pod-mem-hog-custom-chaos-workflow-1650036933-3423601540\" not found"
time="2022-04-27T16:57:46Z" level=info msg="WORKFLOW EVENT 2da29389-d762-4a0f-9a63-a1fcf40d9eea ADD"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-compute-storage-chaos-1650312716-2436886287\" not found"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-compute-storage-chaos-1650312716-3598356347\" not found"
time="2022-04-27T16:57:46Z" level=info msg="WORKFLOW EVENT 989b23c2-b6e1-433b-868b-4c0da9500a89 ADD"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-compute-storage-chaos-1650335985-1439726534\" not found"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-compute-storage-chaos-1650335985-3671274228\" not found"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-compute-storage-chaos-1650335985-1894300885\" not found"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-compute-storage-chaos-1650335985-3193788743\" not found"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-compute-storage-chaos-1650335985-3887398078\" not found"
time="2022-04-27T16:57:46Z" level=info msg="WORKFLOW EVENT a44553da-a46f-443d-a72a-de89c7cf03a7 ADD"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-compute-storage-chaos-1650311061-3966184222\" not found"
time="2022-04-27T16:57:46Z" level=info msg="RESPONSE {\"data\":{\"chaosWorkflowRun\":\"Workflow Run Discarded[Duplicate Event]\"}}"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-compute-storage-chaos-1650311061-2198713499\" not found"
time="2022-04-27T16:57:46Z" level=info msg="RESPONSE {\"data\":{\"chaosWorkflowRun\":\"Workflow Run Discarded[Duplicate Event]\"}}"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-compute-storage-chaos-1650311061-584186354\" not found"
time="2022-04-27T16:57:46Z" level=info msg="RESPONSE {\"data\":{\"chaosWorkflowRun\":\"Workflow Run Discarded[Duplicate Event]\"}}"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-compute-storage-chaos-1650311061-1386825063\" not found"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-compute-storage-chaos-1650311061-3421181727\" not found"
time="2022-04-27T16:57:46Z" level=info msg="WORKFLOW EVENT 51bb8ee4-11ca-492f-a304-51bdd3cd6c8a ADD"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-compute-storage-chaos-1650334657-1937552831\" not found"
time="2022-04-27T16:57:46Z" level=info msg="RESPONSE {\"data\":{\"chaosWorkflowRun\":\"Workflow Run Discarded[Duplicate Event]\"}}"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-compute-storage-chaos-1650334657-3406520715\" not found"
time="2022-04-27T16:57:46Z" level=info msg="WORKFLOW EVENT e8795e6d-cb7c-40b0-bf70-89c6f67d187e ADD"
time="2022-04-27T16:57:46Z" level=info msg="RESPONSE {\"data\":{\"chaosWorkflowRun\":\"Workflow Run Discarded[Duplicate Event]\"}}"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"custom-chaos-workflow-1650402505-3737902564\" not found"
time="2022-04-27T16:57:46Z" level=info msg="WORKFLOW EVENT 91e724a2-7c59-4bdf-a93d-29bf31eee2db ADD"
time="2022-04-27T16:57:46Z" level=info msg="RESPONSE {\"data\":{\"chaosWorkflowRun\":\"Workflow Run Discarded[Duplicate Event]\"}}"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"pod-mem-hog-custom-chaos-workflow-1650307270-1255089079\" not found"
time="2022-04-27T16:57:46Z" level=info msg="WORKFLOW EVENT b600b180-afcf-4c32-839a-6b16217aa316 ADD"
time="2022-04-27T16:57:46Z" level=info msg="RESPONSE {\"data\":{\"chaosWorkflowRun\":\"Workflow Run Discarded[Duplicate Event]\"}}"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"podtato-head-1646421821-2392795814\" not found"
time="2022-04-27T16:57:46Z" level=info msg="WORKFLOW EVENT 84de4c3f-8dcf-42a1-8c58-4b97c86c0fa0 ADD"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="chaosengines.litmuschaos.io \"app-pod-network-latency\" not found"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="chaosengines.litmuschaos.io \"app-pod-network-duplication\" not found"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="chaosengines.litmuschaos.io \"app-pod-network-partition\" not found"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="chaosengines.litmuschaos.io \"app-pod-network-loss\" not found"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="chaosengines.litmuschaos.io \"app-pod-dns-error\" not found"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="chaosengines.litmuschaos.io \"app-pod-dns-spoof\" not found"
time="2022-04-27T16:57:46Z" level=info msg="RESPONSE {\"data\":{\"chaosWorkflowRun\":\"Workflow Run Discarded[Duplicate Event]\"}}"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="chaosengines.litmuschaos.io \"app-pod-network-corruption\" not found"
time="2022-04-27T16:57:46Z" level=info msg="WORKFLOW EVENT 10df09b2-6725-4df5-a5fc-764f61521db5 ADD"
time="2022-04-27T16:57:46Z" level=info msg="RESPONSE {\"data\":{\"chaosWorkflowRun\":\"Workflow Run Discarded[Duplicate Event]\"}}"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-compute-storage-chaos-1650307577-3642230217\" not found"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-compute-storage-chaos-1650307577-3711898857\" not found"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-compute-storage-chaos-1650307577-4233762740\" not found"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-compute-storage-chaos-1650307577-4071305457\" not found"
time="2022-04-27T16:57:46Z" level=info msg="FAILED PARSING CHAOS ENGINE CRD" error="pods \"djin-shared-app-compute-storage-chaos-1650307577-3871545623\" not found"
time="2022-04-27T16:57:46Z" level=info msg="RESPONSE {\"data\":{\"chaosWorkflowRun\":\"Workflow Run Discarded[Duplicate Event]\"}}"
time="2022-04-27T16:57:47Z" level=info msg="Cluster Connect Established, Listening...."
time="2022-04-27T16:57:47Z" level=error msg="graphql error : {\"payload\":{\"errors\":[{\"message\":\"CLUSTER ALREADY CONNECTED\",\"path\":[\"clusterConnect\"]}],\"data\":null},\"type\":\"data\"}\n"

cryslam avatar Apr 27 '22 19:04 cryslam

The subscriber is the point of contact for the portal server if the subscriber is down all features relating to that cluster are blocked too and that is why every time the subscriber is down the status of the cluster changes to inactive. This makes sure that users are aware that the subscriber is down and they cannot run any workflows or get any updates on existing workflows running in the cluster until it connects back. This also gives users an action item to check on the subscriber in case it is stuck in the inactive state for a long duration.

Though the status of the cluster should automatically update when the subscriber comes live, in your case is that true or is the status still set to inactive even when the subscriber recovers?

gdsoumya avatar Apr 27 '22 19:04 gdsoumya

So the status is still set to Inactive even when the subscriber recovers and is in a Running state and shows in kubectl describe it restarted once or a few times & and have similar log messages that I pasted above.

time="2022-04-27T16:57:47Z" level=info msg="Cluster Connect Established, Listening...."
time="2022-04-27T16:57:47Z" level=error msg="graphql error : {\"payload\":{\"errors\":[{\"message\":\"CLUSTER ALREADY CONNECTED\",\"path\":[\"clusterConnect\"]}],\"data\":null},\"type\":\"data\"}\n"

cryslam avatar Apr 27 '22 20:04 cryslam

I tried replicating this locally but every time I delete/disconnect the subscriber pod it automatically connects back and the status is also updated. Can you confirm this is the same case with all clusters you have connected by connecting a new cluster and restarting the subscriber pod. Also can you share the version of litmus you are running

gdsoumya avatar Apr 28 '22 02:04 gdsoumya

I'm running v2.6.0 of litmus and Litmusctl version: v0.7.0

I disconnected and removed all microservices that gets deployed when installing external agents. I reinstalled it, restarted subscriber pod by doing a rollout restart and can't replicate the behavior I mentioned above sigh

I wonder if the behavior I was seeing before is related to subscriber not being able to find previous chaosengines even tho it's already connected?

I can also continue to monitor if this happens again, and do a describe on the subscriber pod if it restarted and shows as Inactive in control plane.

cryslam avatar Apr 28 '22 03:04 cryslam

Yes please that would be good. I don't think the issue can be with the chaosengines because the connection status is dependent on the uni-directional websocket connection from the server to the subscriber, while the chaosengines are reported through a separate connection altogether. It could have also been an intermittent issue due to network or database write failure.

gdsoumya avatar Apr 28 '22 03:04 gdsoumya

Hi, were you able to resolve this issue?

neelanjan00 avatar Oct 18 '22 13:10 neelanjan00

Closing this issue.

imrajdas avatar Mar 15 '23 11:03 imrajdas

I am having the same issue,. any fixes? Right now, I need to rotate the subscriber pod to get it fixed.

ankitjain28may avatar May 09 '23 15:05 ankitjain28may