inspektor-gadget icon indicating copy to clipboard operation
inspektor-gadget copied to clipboard

[CI] integration: Increase pod and job wait timeout to 45s

Open burak-ok opened this issue 10 months ago • 9 comments

integration: Increase pod and job ready timeout to 45s

We get CI test failures because of timeouts and therefore we could maybe wait an extra 15s (The default timeout is 30s)

  • https://github.com/inspektor-gadget/inspektor-gadget/actions/runs/8448407680/job/23142053593#step:11:2168
  • https://github.com/inspektor-gadget/inspektor-gadget/actions/runs/8448407680/job/23142052066 (Log too long for direct link to line
    2024-03-27T08:52:39.6903877Z === NAME  TestTraceOOMKill
    2024-03-27T08:52:39.6904729Z     command.go:483: Command returned(WaitForTestPod):
    2024-03-27T08:52:39.6905683Z         error: timed out waiting for the condition on pods/test-pod
    
  • https://github.com/inspektor-gadget/inspektor-gadget/actions/runs/8359069409/job/22881852532#step:7:2759

burak-ok avatar Mar 27 '24 11:03 burak-ok

This does not tackle the root cause:

command.go:483: Command returned(RunOomkillTestPod):

pod/test-pod created

command.go:481: Run command(WaitForTestPod):
kubectl wait pod --for condition=ready -n test-trace-oomkill-8931607124889025702 test-pod
command.go:483: Command returned(WaitForTestPod):
error: timed out waiting for the condition on pods/test-pod
NAME       READY   STATUS      RESTARTS   AGE   IP           NODE                                NOMINATED NODE   READINESS GATES
test-pod   0/1     OOMKilled   0          41s   10.244.2.9   aks-nodepool1-35974623-vmss000001   <none>           <none>
...

Please, see: https://github.com/inspektor-gadget/inspektor-gadget/pull/2649

eiffel-fl avatar Mar 27 '24 11:03 eiffel-fl

Please, see: #2649

This is not only about OOMKill. Please see the tests also from the first and last link. The issues we have with the CI will not be solved by a single PR. Its not a single issue

=== NAME  TestRunInsecure
  command.go:483: Command returned(WaitForJob: copier):
       error: timed out waiting for the condition on jobs/copier

and

=== NAME  TestProfileCpu
    command.go:483: Command returned(WaitForTestPod):
        error: timed out waiting for the condition on pods/test-pod

burak-ok avatar Mar 27 '24 12:03 burak-ok

Please, see: #2649

This is not only about OOMKill. Please see the tests also from the first and last link. The issues we have with the CI will not be solved by a single PR. Its not a single issue

=== NAME  TestRunInsecure
  command.go:483: Command returned(WaitForJob: copier):
       error: timed out waiting for the condition on jobs/copier

and

=== NAME  TestProfileCpu
    command.go:483: Command returned(WaitForTestPod):
        error: timed out waiting for the condition on pods/test-pod

I stick to my guns, you can indeed increase the waiting time, but the best solution would be to find the why of this time out. For oomkill, we now know it, for the others, so far, I do not have any clue.

eiffel-fl avatar Mar 27 '24 12:03 eiffel-fl

I stick to my guns, you can indeed increase the waiting time, but the best solution would be to find the why of this time out. For oomkill, we now know it, for the others, so far, I do not have any clue.

For the examples (besides the OOMKill) I have no other idea what it might be besides the network may be too slow and overloaded, so the images doesn't get pulled "fast" enough. And then of course the CPU constraints.

burak-ok avatar Mar 27 '24 13:03 burak-ok

I stick to my guns, you can indeed increase the waiting time, but the best solution would be to find the why of this time out. For oomkill, we now know it, for the others, so far, I do not have any clue.

For the examples (besides the OOMKill) I have no other idea what it might be besides the network may be too slow and overloaded, so the images doesn't get pulled "fast" enough. And then of course the CPU constraints.

We are running several tests in parallel so we might be overloading the cluster. I remember @mauriciovasquezbernal had similar issues when working on EKS. So, we should consider limiting the number of parallel tests.

blanquicet avatar Mar 27 '24 20:03 blanquicet

I stick to my guns, you can indeed increase the waiting time, but the best solution would be to find the why of this time out. For oomkill, we now know it, for the others, so far, I do not have any clue.

For the examples (besides the OOMKill) I have no other idea what it might be besides the network may be too slow and overloaded, so the images doesn't get pulled "fast" enough. And then of course the CPU constraints.

We are running several tests in parallel so we might be overloading the cluster. I remember @mauriciovasquezbernal had similar issues when working on EKS. So, we should consider limiting the number of parallel tests.

:rofl: :rofl: :rofl: No, we just need to get bigger clusters.

eiffel-fl avatar Mar 28 '24 08:03 eiffel-fl

I stick to my guns, you can indeed increase the waiting time, but the best solution would be to find the why of this time out. For oomkill, we now know it, for the others, so far, I do not have any clue.

For the examples (besides the OOMKill) I have no other idea what it might be besides the network may be too slow and overloaded, so the images doesn't get pulled "fast" enough. And then of course the CPU constraints.

We are running several tests in parallel so we might be overloading the cluster. I remember @mauriciovasquezbernal had similar issues when working on EKS. So, we should consider limiting the number of parallel tests.

🤣 🤣 🤣 No, we just need to get bigger clusters.

Yep, that's the other option 😝

blanquicet avatar Apr 01 '24 14:04 blanquicet

I think we need to limit the number of parallel test. I remember I tried it out before when playing with EKS and it didn't work fine, but I just realized I could have confused the -p and --parallel flags (https://stackoverflow.com/a/72354980), so I wasn't limiting the number of parallel tests that can be run at all.

mauriciovasquezbernal avatar Apr 08 '24 18:04 mauriciovasquezbernal

I think we need to limit the number of parallel test.

In the fail log https://github.com/inspektor-gadget/inspektor-gadget/actions/runs/8359069409/job/22881852532#step:7:2759 the failure happened on GKE where we use t2a-standard-2, which has 2 vCPUs.

-parallel defaults to GOMAXPROCS which means we ran with -parallel 2 for GKE.

So the only thing we could do for GKE is disable parallel test completely and potentially double the CI time

burak-ok avatar Apr 09 '24 11:04 burak-ok

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Jun 09 '24 01:06 github-actions[bot]

This pull request has been automatically closed due to inactivity. If you believe this was closed in error, please feel free to reopen it.

github-actions[bot] avatar Jun 24 '24 01:06 github-actions[bot]