inspektor-gadget
inspektor-gadget copied to clipboard
[CI] integration: Increase pod and job wait timeout to 45s
integration: Increase pod and job ready timeout to 45s
We get CI test failures because of timeouts and therefore we could maybe wait an extra 15s (The default timeout is 30s)
- https://github.com/inspektor-gadget/inspektor-gadget/actions/runs/8448407680/job/23142053593#step:11:2168
- https://github.com/inspektor-gadget/inspektor-gadget/actions/runs/8448407680/job/23142052066 (Log too long for direct link to line
2024-03-27T08:52:39.6903877Z === NAME TestTraceOOMKill 2024-03-27T08:52:39.6904729Z command.go:483: Command returned(WaitForTestPod): 2024-03-27T08:52:39.6905683Z error: timed out waiting for the condition on pods/test-pod
- https://github.com/inspektor-gadget/inspektor-gadget/actions/runs/8359069409/job/22881852532#step:7:2759
This does not tackle the root cause:
command.go:483: Command returned(RunOomkillTestPod):
pod/test-pod created
command.go:481: Run command(WaitForTestPod):
kubectl wait pod --for condition=ready -n test-trace-oomkill-8931607124889025702 test-pod
command.go:483: Command returned(WaitForTestPod):
error: timed out waiting for the condition on pods/test-pod
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
test-pod 0/1 OOMKilled 0 41s 10.244.2.9 aks-nodepool1-35974623-vmss000001 <none> <none>
...
Please, see: https://github.com/inspektor-gadget/inspektor-gadget/pull/2649
Please, see: #2649
This is not only about OOMKill. Please see the tests also from the first and last link. The issues we have with the CI will not be solved by a single PR. Its not a single issue
=== NAME TestRunInsecure command.go:483: Command returned(WaitForJob: copier): error: timed out waiting for the condition on jobs/copier
and
=== NAME TestProfileCpu command.go:483: Command returned(WaitForTestPod): error: timed out waiting for the condition on pods/test-pod
Please, see: #2649
This is not only about OOMKill. Please see the tests also from the first and last link. The issues we have with the CI will not be solved by a single PR. Its not a single issue
=== NAME TestRunInsecure command.go:483: Command returned(WaitForJob: copier): error: timed out waiting for the condition on jobs/copier
and
=== NAME TestProfileCpu command.go:483: Command returned(WaitForTestPod): error: timed out waiting for the condition on pods/test-pod
I stick to my guns, you can indeed increase the waiting time, but the best solution would be to find the why of this time out.
For oomkill
, we now know it, for the others, so far, I do not have any clue.
I stick to my guns, you can indeed increase the waiting time, but the best solution would be to find the why of this time out. For
oomkill
, we now know it, for the others, so far, I do not have any clue.
For the examples (besides the OOMKill) I have no other idea what it might be besides the network may be too slow and overloaded, so the images doesn't get pulled "fast" enough. And then of course the CPU constraints.
I stick to my guns, you can indeed increase the waiting time, but the best solution would be to find the why of this time out. For
oomkill
, we now know it, for the others, so far, I do not have any clue.For the examples (besides the OOMKill) I have no other idea what it might be besides the network may be too slow and overloaded, so the images doesn't get pulled "fast" enough. And then of course the CPU constraints.
We are running several tests in parallel so we might be overloading the cluster. I remember @mauriciovasquezbernal had similar issues when working on EKS. So, we should consider limiting the number of parallel tests.
I stick to my guns, you can indeed increase the waiting time, but the best solution would be to find the why of this time out. For
oomkill
, we now know it, for the others, so far, I do not have any clue.For the examples (besides the OOMKill) I have no other idea what it might be besides the network may be too slow and overloaded, so the images doesn't get pulled "fast" enough. And then of course the CPU constraints.
We are running several tests in parallel so we might be overloading the cluster. I remember @mauriciovasquezbernal had similar issues when working on EKS. So, we should consider limiting the number of parallel tests.
:rofl: :rofl: :rofl: No, we just need to get bigger clusters.
I stick to my guns, you can indeed increase the waiting time, but the best solution would be to find the why of this time out. For
oomkill
, we now know it, for the others, so far, I do not have any clue.For the examples (besides the OOMKill) I have no other idea what it might be besides the network may be too slow and overloaded, so the images doesn't get pulled "fast" enough. And then of course the CPU constraints.
We are running several tests in parallel so we might be overloading the cluster. I remember @mauriciovasquezbernal had similar issues when working on EKS. So, we should consider limiting the number of parallel tests.
🤣 🤣 🤣 No, we just need to get bigger clusters.
Yep, that's the other option 😝
I think we need to limit the number of parallel test. I remember I tried it out before when playing with EKS and it didn't work fine, but I just realized I could have confused the -p
and --parallel
flags (https://stackoverflow.com/a/72354980), so I wasn't limiting the number of parallel tests that can be run at all.
I think we need to limit the number of parallel test.
In the fail log https://github.com/inspektor-gadget/inspektor-gadget/actions/runs/8359069409/job/22881852532#step:7:2759 the failure happened on GKE where we use t2a-standard-2
, which has 2 vCPUs
.
-parallel
defaults to GOMAXPROCS
which means we ran with -parallel 2
for GKE.
So the only thing we could do for GKE is disable parallel test completely and potentially double the CI time
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
This pull request has been automatically closed due to inactivity. If you believe this was closed in error, please feel free to reopen it.