mina
mina copied to clipboard
intg test: track by workload rather than pod
problem:
in test_executive.ml, there's a line let%bind network = Engine.Network_manager.deploy net_manager. it deploys the testnet, and jots down some metadata to track in network
. the deploy function, calls Kubernetes_network.Workload.get_nodes
to find out the ids of the pods
the id of the pods look something like this: test-block-producer-1-67b6b5765c-sfx6t
the "67b6b5765c" i believe has to do with the public key, the "sfx6t" is a totally random tag that kubernetes gives to every distinct pod. if the pod for whatever random reason goes down or gets pre empted or whatever, then kubernetes will start it back up again. however when it starts back up, the kubernetes pod tag is going to be different. it'll be, for example test-block-producer-1-67b6b5765c-6kfhv
. however the network
struct is still going to track the original pod id and that could cause problems. i'm sure this is the root of some flakiness in the system. if function tries to reach out to a pod that died and was replaced, it's going to crash for sure. i think many errors that say "container not found" are at root this problem. also many of the times when it gets stuck in the beginning at "Waiting for pods to be assigned nodes" and then times out and exits with exit code 4, that's because a pod was killed and assigned another node and received a different pod tag. also if it says "cannot exec into a container in a completed pod; current phase is Failed", and exits with exit code 5 (i think), then this bug is also at the root of it.
solution:
query by workload/deployment instead of finding the actual pod. the pod in kubectl is supposed to change often and the pod name should not be changed.
other outdated solution: network
struct that should be mutable. everytime the framework tries to do something directly to a pod, it should call something like Kubernetes_network.Workload.get_nodes
and make sure the network
struct is up to date. before checking on the node status or trying to send graphql queries to the node, we should refresh the network
struct first.