nebari icon indicating copy to clipboard operation
nebari copied to clipboard

[MAINT] - Adress flakiness of Integration tests

Open viniciusdc opened this issue 11 months ago • 3 comments

Context

With the recent adoption of await workflow, which is a blessing since before we needed to include the kubectl command ourself anyways manually, we are getting some weird issues a few times with the image puller; it seems like it got stuck waiting for it in a couple of deployments, looks like a flaky behavior and requires further validation. There may be a need to increase the time limit or retries.

Image source: https://github.com/nebari-dev/nebari/actions/runs/12994981631/job/36240642535?pr=2924

Also, during releases, we have a hard time running CI against version bumps since, by common standard during the release workflow, we don't yet have the new images available, and the deployment fails under the check health status of the pods (namely jupyterhub)

Image source: https://github.com/nebari-dev/nebari/actions/runs/12952884533/job/36211476433?pr=2924

Value and/or benefit

Running/stable testing

Anything else?

No response

viniciusdc avatar Jan 27 '25 18:01 viniciusdc

@viniciusdc I think the first case is related to https://github.com/nebari-dev/nebari/issues/2947. However, I agree our tests seem to be flaky and that needs to be addressed.

marcelovilla avatar Feb 10 '25 10:02 marcelovilla

I recently noticed that there is another action that you can run with the jupyterhub/action-k8s-await-workloads@v3 and it allows you to inspect the affected pods (though usually we not need it since it generates too much data) for this specific error it allowed me finding a problem with promtail as seen bellow:

Image Image

Which is a know issue for running Kind: https://kind.sigs.k8s.io/docs/user/known-issues/#pod-errors-due-to-too-many-open-files

I think I addressed this in the past, but maybe with the new update to ubuntu 24.x #2958 this might've been removed.

Since this is a bit different and mostly associated with the above update, I will open a new issue:

  • [ ] Address fsnotify "too many open files" error on test-local-integration

viniciusdc avatar Feb 19 '25 16:02 viniciusdc

This is one workflow where we can see the above error message https://github.com/nebari-dev/nebari/actions/runs/13415146171/job/37478696321?pr=2965, and here is a second run with the update of inotify https://github.com/nebari-dev/nebari/actions/runs/13417860818/job/37483043593?pr=2965

viniciusdc avatar Feb 19 '25 17:02 viniciusdc