kustomize Flaky integration tests

trafficstars

Describe the bug

Many integration tests that kustomize build urls in remoteload_test.go are flaky. They exhibit intended behavior on my machine, but sporadically fail when run for every PR on the server.

Files that can reproduce the issue

We have observed the following flaky tests:

Expected output

The expected output is written in the test cases.

Actual output

On my local machine, the output is as expected. On the server, the tests mostly pass, but occasionally fail. This logs the output of some of the flaky tests on a server run.

Kustomize version

I ran the tests on the master branch, where HEAD was at commit 22668ea.

Platform

I use macOS. The tests only fail for macOS (not Linux) on the server.

Additional context

Issue #4623 also mentions this issue.

May 13 '22 17:05 annasong20

Hey @annasong20, I saw issue, I'm learning how to write go-tests. Can I work on this?

May 23 '22 02:05 rajatgupta24

@rajatgupta24 Sure, go for it!

May 23 '22 02:05 annasong20

/triage accepted

Aug 10 '22 19:08 natasha41575

After a flaky run, I found that all flaky tests failed on git checkout FETCH_HEAD in cloner. Given that we run tests concurrently, I believe the flaky tests fail when repos from the different tests are cloned concurrently and FETCH_HEAD points to the wrong HEAD.

We can fix this either by changing the line git checkout FETCH_HEAD or running the integration tests sequentially.

Aug 10 '22 21:08 annasong20

FYI @annasong20 I looked into it a little bit and I think you might be able to replace https://github.com/kubernetes-sigs/kustomize/blob/master/api/internal/git/cloner.go#L33-36 with git fetch origin --depth=1 and git checkout origin/HEAD. I'm not 100% sure (I'm not a git expert) so will need you to verify.

Aug 25 '22 18:08 natasha41575

/assign

Aug 25 '22 18:08 annasong20

If there isn't a clean threadsafe way to write the git commands, we can also consider locking these critical lines of code with a mutex.

Aug 25 '22 18:08 natasha41575

Per offline discussion, it has come to our attention that it being a concurrency issue is unlikely as all the remote tests are running in the same package and therefore not running in parallel.

Aug 26 '22 13:08 natasha41575

My update after looking into this issue some more:

Each flaky test fails for the same reason: it gets stuck on git checkout FETCH_HEAD. However, as stated above, this shouldn't be a concurrent test issue. Moreover, even if the tests ran concurrently, this shouldn't be a concurrency issue, as each git command should be run in its own unique, temporary directory.
The git checkout may be timing out because
- the git fetch in the line before isn't behaving correctly, though this is questionable because we check for errors.
- some other process is locking the files in the git directory, and blocking git fetch indefinitely.
A failed sub-test will fail the parent test because the same require is used. This is a bug I need to fix.

Aug 26 '22 18:08 annasong20

I'm not sure this fully explains it, but the sheer size of our repo now that it has a docs site again is likely a contributing factor. We should consider preventing submodule initalization and raising the timeout on tests where this isn't important (Kustomize supports this already).

That said, @natasha41575 @annasong20 and I had a discussion about this suite, and I proposed that we go back to the coverage we want to have instead of necessarily fixing the tests in their current form. Here's the tentative plan for what we want: (1) An exhaustive suite of unit tests covering the user input -> parameters -> url/params that get passed to git piece. This probably just means enumerating the permutations and doing an audit of repospec_test.go. (2) A suite of integration tests that use the file:// protocol and exhaustively test all the query parameters we support. We don't have this at all yet, and it would replace most of the tests in remoteload_test.go. (3) A small number of protocol tests that are free to use any parameters that make sense (e.g. we probably want to disable submodules and extend the timeout).

@mightyguava if you're able to help us with this, we would greatly appreciate it.

Aug 26 '22 19:08 KnVerey

/assign @mightyguava

Aug 29 '22 13:08 mightyguava

With #4777 and #4783, can this issue be closed?

Sep 20 '22 13:09 mightyguava

Yes, 🤞 those fixed it. We'll need some time to see enough CI runs to feel completely confident (e.g. that we don't need to add retries to the protocol tests), but we can close this for now and reopen if we discover there's more to be done.

/close

Sep 22 '22 22:09 KnVerey

@KnVerey: Closing this issue.

In response to this:

Yes, 🤞 those fixed it. We'll need some time to see enough CI runs to feel completely confident (e.g. that we don't need to add retries to the protocol tests), but we can close this for now and reopen if we discover there's more to be done.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sep 22 '22 22:09 k8s-ci-robot

kustomize kustomize copied to clipboard

Flaky integration tests

kustomize
kustomize copied to clipboard