mina icon indicating copy to clipboard operation
mina copied to clipboard

intg test: track by workload rather than pod

Open QuiteStochastic opened this issue 2 years ago • 0 comments

problem:

in test_executive.ml, there's a line let%bind network = Engine.Network_manager.deploy net_manager. it deploys the testnet, and jots down some metadata to track in network. the deploy function, calls Kubernetes_network.Workload.get_nodes to find out the ids of the pods

the id of the pods look something like this: test-block-producer-1-67b6b5765c-sfx6t

the "67b6b5765c" i believe has to do with the public key, the "sfx6t" is a totally random tag that kubernetes gives to every distinct pod. if the pod for whatever random reason goes down or gets pre empted or whatever, then kubernetes will start it back up again. however when it starts back up, the kubernetes pod tag is going to be different. it'll be, for example test-block-producer-1-67b6b5765c-6kfhv. however the network struct is still going to track the original pod id and that could cause problems. i'm sure this is the root of some flakiness in the system. if function tries to reach out to a pod that died and was replaced, it's going to crash for sure. i think many errors that say "container not found" are at root this problem. also many of the times when it gets stuck in the beginning at "Waiting for pods to be assigned nodes" and then times out and exits with exit code 4, that's because a pod was killed and assigned another node and received a different pod tag. also if it says "cannot exec into a container in a completed pod; current phase is Failed", and exits with exit code 5 (i think), then this bug is also at the root of it.

solution:

query by workload/deployment instead of finding the actual pod. the pod in kubectl is supposed to change often and the pod name should not be changed.

other outdated solution: the network struct that should be mutable. everytime the framework tries to do something directly to a pod, it should call something like Kubernetes_network.Workload.get_nodes and make sure the network struct is up to date. before checking on the node status or trying to send graphql queries to the node, we should refresh the network struct first.

QuiteStochastic avatar Jul 20 '22 23:07 QuiteStochastic