kustomize
kustomize copied to clipboard
Flaky integration tests
Describe the bug
Many integration tests that kustomize build
urls in remoteload_test.go are flaky. They exhibit intended behavior on my machine, but sporadically fail when run for every PR on the server.
Files that can reproduce the issue
We have observed the following flaky tests:
- TestRemoteResourceRepo: https, no ref
- TestRemoteResourceParameters: https no params
- TestRemoteResourceParameters: https master
- TestRemoteResourceParameters: https master and no submodules
- TestRemoteResourceParameters: https all params
- TestRemoteResourceGoGetter: git detector with / subdirectory separator
- TestRemoteResourceGoGetter: git detector for repo
- TestRemoteResourceGoGetter: https with / subdirectory separator
- git forced protocol with / subdirectory separator
Expected output
The expected output is written in the test cases.
Actual output
On my local machine, the output is as expected. On the server, the tests mostly pass, but occasionally fail. This logs the output of some of the flaky tests on a server run.
Kustomize version
I ran the tests on the master branch, where HEAD was at commit 22668ea.
Platform
I use macOS. The tests only fail for macOS (not Linux) on the server.
Additional context
Issue #4623 also mentions this issue.
Hey @annasong20, I saw issue, I'm learning how to write go-tests. Can I work on this?
@rajatgupta24 Sure, go for it!
/triage accepted
After a flaky run, I found that all flaky tests failed on git checkout FETCH_HEAD in cloner. Given that we run tests concurrently, I believe the flaky tests fail when repos from the different tests are cloned concurrently and FETCH_HEAD
points to the wrong HEAD.
We can fix this either by changing the line git checkout FETCH_HEAD
or running the integration tests sequentially.
FYI @annasong20 I looked into it a little bit and I think you might be able to replace https://github.com/kubernetes-sigs/kustomize/blob/master/api/internal/git/cloner.go#L33-36 with git fetch origin --depth=1
and git checkout origin/HEAD
. I'm not 100% sure (I'm not a git expert) so will need you to verify.
/assign
If there isn't a clean threadsafe way to write the git commands, we can also consider locking these critical lines of code with a mutex.
Per offline discussion, it has come to our attention that it being a concurrency issue is unlikely as all the remote tests are running in the same package and therefore not running in parallel.
My update after looking into this issue some more:
- Each flaky test fails for the same reason: it gets stuck on
git checkout FETCH_HEAD
. However, as stated above, this shouldn't be a concurrent test issue. Moreover, even if the tests ran concurrently, this shouldn't be a concurrency issue, as eachgit
command should be run in its own unique, temporary directory. - The
git checkout
may be timing out because- the
git fetch
in the line before isn't behaving correctly, though this is questionable because we check for errors. - some other process is locking the files in the git directory, and blocking
git fetch
indefinitely.
- the
- A failed sub-test will fail the parent test because the same
require
is used. This is a bug I need to fix.
I'm not sure this fully explains it, but the sheer size of our repo now that it has a docs site again is likely a contributing factor. We should consider preventing submodule initalization and raising the timeout on tests where this isn't important (Kustomize supports this already).
That said, @natasha41575 @annasong20 and I had a discussion about this suite, and I proposed that we go back to the coverage we want to have instead of necessarily fixing the tests in their current form. Here's the tentative plan for what we want:
(1) An exhaustive suite of unit tests covering the user input -> parameters -> url/params that get passed to git
piece. This probably just means enumerating the permutations and doing an audit of repospec_test.go.
(2) A suite of integration tests that use the file://
protocol and exhaustively test all the query parameters we support. We don't have this at all yet, and it would replace most of the tests in remoteload_test.go.
(3) A small number of protocol tests that are free to use any parameters that make sense (e.g. we probably want to disable submodules and extend the timeout).
@mightyguava if you're able to help us with this, we would greatly appreciate it.
/assign @mightyguava
With #4777 and #4783, can this issue be closed?
Yes, 🤞 those fixed it. We'll need some time to see enough CI runs to feel completely confident (e.g. that we don't need to add retries to the protocol tests), but we can close this for now and reopen if we discover there's more to be done.
/close
@KnVerey: Closing this issue.
In response to this:
Yes, 🤞 those fixed it. We'll need some time to see enough CI runs to feel completely confident (e.g. that we don't need to add retries to the protocol tests), but we can close this for now and reopen if we discover there's more to be done.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.