benchmark-runner icon indicating copy to clipboard operation
benchmark-runner copied to clipboard

wait_for_pod_completed does not detect pod going into error state

Open RobertKrawitz opened this issue 3 years ago • 1 comments

If a pod goes into error state, wait_for_pod_completed does not detect that and hangs until timeout:

Method name: wait_for_initialized {'label': 'app=system-metrics-collector', 'workload': 'uperf-vm'} , Start time: 2021-12-09 09:19:22 
Method name: wait_for_initialized , End time: 2021-12-09 09:19:22 , Total time: 0.12 sec
Method name: wait_for_pod_completed {'label': 'app=system-metrics-collector', 'workload': 'uperf-vm'} , Start time: 2021-12-09 09:19:22 

But the pod has gone into an error state:

# oc logs -n benchmark-operator system-metrics-collector-2d2e4517-t6dd2
time="2021-12-09 14:22:47" level=info msg="📁 Creating indexer: elastic"
time="2021-12-09 14:24:57" level=fatal msg="ES health check failed: dial tcp XXX.XXX.XXX.XXX:YYYYY: connect: connection timed out"

RobertKrawitz avatar Dec 09 '21 15:12 RobertKrawitz

The failure in this case is external (my Elasticsearch server isn't running correctly), but the pod failure should be detected and it should error out.

RobertKrawitz avatar Dec 09 '21 15:12 RobertKrawitz