integration: Increase pod and job ready timeout to 45s

We get CI test failures because of timeouts and therefore we could maybe wait an extra 15s (The default timeout is 30s)

https://github.com/inspektor-gadget/inspektor-gadget/actions/runs/8448407680/job/23142053593#step:11:2168

https://github.com/inspektor-gadget/inspektor-gadget/actions/runs/8448407680/job/23142052066 (Log too long for direct link to line

2024-03-27T08:52:39.6903877Z === NAME  TestTraceOOMKill
2024-03-27T08:52:39.6904729Z     command.go:483: Command returned(WaitForTestPod):
2024-03-27T08:52:39.6905683Z         error: timed out waiting for the condition on pods/test-pod

https://github.com/inspektor-gadget/inspektor-gadget/actions/runs/8359069409/job/22881852532#step:7:2759

Mar 27 '24 11:03 burak-ok

This does not tackle the root cause:

command.go:483: Command returned(RunOomkillTestPod):

pod/test-pod created

command.go:481: Run command(WaitForTestPod):
kubectl wait pod --for condition=ready -n test-trace-oomkill-8931607124889025702 test-pod
command.go:483: Command returned(WaitForTestPod):
error: timed out waiting for the condition on pods/test-pod
NAME       READY   STATUS      RESTARTS   AGE   IP           NODE                                NOMINATED NODE   READINESS GATES
test-pod   0/1     OOMKilled   0          41s   10.244.2.9   aks-nodepool1-35974623-vmss000001   <none>           <none>
...

Please, see: https://github.com/inspektor-gadget/inspektor-gadget/pull/2649

Mar 27 '24 11:03 eiffel-fl

Please, see: #2649

This is not only about OOMKill. Please see the tests also from the first and last link. The issues we have with the CI will not be solved by a single PR. Its not a single issue

=== NAME  TestRunInsecure
  command.go:483: Command returned(WaitForJob: copier):
       error: timed out waiting for the condition on jobs/copier

and

=== NAME  TestProfileCpu
    command.go:483: Command returned(WaitForTestPod):
        error: timed out waiting for the condition on pods/test-pod

Mar 27 '24 12:03 burak-ok

Please, see: #2649

This is not only about OOMKill. Please see the tests also from the first and last link. The issues we have with the CI will not be solved by a single PR. Its not a single issue
=== NAME  TestRunInsecure
  command.go:483: Command returned(WaitForJob: copier):
       error: timed out waiting for the condition on jobs/copier
and
=== NAME  TestProfileCpu
    command.go:483: Command returned(WaitForTestPod):
        error: timed out waiting for the condition on pods/test-pod

I stick to my guns, you can indeed increase the waiting time, but the best solution would be to find the why of this time out. For oomkill, we now know it, for the others, so far, I do not have any clue.

Mar 27 '24 12:03 eiffel-fl

I stick to my guns, you can indeed increase the waiting time, but the best solution would be to find the why of this time out. For oomkill, we now know it, for the others, so far, I do not have any clue.

For the examples (besides the OOMKill) I have no other idea what it might be besides the network may be too slow and overloaded, so the images doesn't get pulled "fast" enough. And then of course the CPU constraints.

Mar 27 '24 13:03 burak-ok

I stick to my guns, you can indeed increase the waiting time, but the best solution would be to find the why of this time out. For oomkill, we now know it, for the others, so far, I do not have any clue.

For the examples (besides the OOMKill) I have no other idea what it might be besides the network may be too slow and overloaded, so the images doesn't get pulled "fast" enough. And then of course the CPU constraints.

We are running several tests in parallel so we might be overloading the cluster. I remember @mauriciovasquezbernal had similar issues when working on EKS. So, we should consider limiting the number of parallel tests.

Mar 27 '24 20:03 blanquicet

I stick to my guns, you can indeed increase the waiting time, but the best solution would be to find the why of this time out. For oomkill, we now know it, for the others, so far, I do not have any clue.

For the examples (besides the OOMKill) I have no other idea what it might be besides the network may be too slow and overloaded, so the images doesn't get pulled "fast" enough. And then of course the CPU constraints.

We are running several tests in parallel so we might be overloading the cluster. I remember @mauriciovasquezbernal had similar issues when working on EKS. So, we should consider limiting the number of parallel tests.

:rofl: :rofl: :rofl: No, we just need to get bigger clusters.

Mar 28 '24 08:03 eiffel-fl

I stick to my guns, you can indeed increase the waiting time, but the best solution would be to find the why of this time out. For oomkill, we now know it, for the others, so far, I do not have any clue.

For the examples (besides the OOMKill) I have no other idea what it might be besides the network may be too slow and overloaded, so the images doesn't get pulled "fast" enough. And then of course the CPU constraints.

We are running several tests in parallel so we might be overloading the cluster. I remember @mauriciovasquezbernal had similar issues when working on EKS. So, we should consider limiting the number of parallel tests.

🤣 🤣 🤣 No, we just need to get bigger clusters.

Yep, that's the other option 😝

Apr 01 '24 14:04 blanquicet

I think we need to limit the number of parallel test. I remember I tried it out before when playing with EKS and it didn't work fine, but I just realized I could have confused the -p and --parallel flags (https://stackoverflow.com/a/72354980), so I wasn't limiting the number of parallel tests that can be run at all.

Apr 08 '24 18:04 mauriciovasquezbernal

I think we need to limit the number of parallel test.

In the fail log https://github.com/inspektor-gadget/inspektor-gadget/actions/runs/8359069409/job/22881852532#step:7:2759 the failure happened on GKE where we use t2a-standard-2, which has 2 vCPUs.

-parallel defaults to GOMAXPROCS which means we ran with -parallel 2 for GKE.

So the only thing we could do for GKE is disable parallel test completely and potentially double the CI time

Apr 09 '24 11:04 burak-ok

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

Jun 09 '24 01:06 github-actions[bot]

This pull request has been automatically closed due to inactivity. If you believe this was closed in error, please feel free to reopen it.

Jun 24 '24 01:06 github-actions[bot]

inspektor-gadget
inspektor-gadget copied to clipboard

[CI] integration: Increase pod and job wait timeout to 45s

integration: Increase pod and job ready timeout to 45s

inspektor-gadget inspektor-gadget copied to clipboard

[CI] integration: Increase pod and job wait timeout to 45s

integration: Increase pod and job ready timeout to 45s

inspektor-gadget
inspektor-gadget copied to clipboard