kcp icon indicating copy to clipboard operation
kcp copied to clipboard

flake: TestWorkspaceDeletionLeak

Open ntnn opened this issue 4 months ago • 2 comments

Describe the bug

The test TestWorkspaceDeletionLeak has been flaking:

Image

I was aware that this could happen when I implemented the test and had planned for a workaround, but that would require the maintainers of goleak to merge an open PR, details are here: https://github.com/kcp-dev/kcp/pull/3491#discussion_r2226127866

Instead the test now uses require.EventuallyWithT (kcptestinghelpers.Eventually would always immediately fail for some reason) but it seems the 30s are not enough:

 I0807 09:17:51.683088   39314 namespace_controller.go:194] "Namespace has been deleted" component="kcp" postStartHook="kcp-start-controllers" namespace="yef8oaknnwv5ohao|default"
{"level":"warn","ts":"2025-08-07T09:18:05.175445Z","caller":"fileutil/purge.go:80","msg":"failed to lock file","path":"/tmp/TestWorkspaceDeletionLeak3304689176/002/artifacts/etcd-server/member/wal/0000000000000000-0000000000000000.wal","error":"fileutil: file already locked"}
    leak_test.go:99: found leaking goroutines: ...
    leak_test.go:99: 
        	Error Trace:	/home/prow/go/src/github.com/kcp-dev/kcp/test/integration/workspace/leak_test.go:99
        	Error:      	Condition never satisfied
        	Test:       	TestWorkspaceDeletionLeak
        	Messages:   	eventually there will be no random goroutines running while checking for leaks
I0807 09:18:20.286940   39314 dynamic_serving_content.go:195] "Failed to remove file watch,

It's also not possible to just shut down the KCP server because that could hide potential leaks.

Just ignoring any goroutines that have to do with http requests also has the potential to hide leaks, e.g. if an uncontexted http request is sent that runs for a long time.

Steps To Reproduce

  1. Make a PR
  2. Wait for the test to fail randomly
  3. If it doesn't retrigger until it does: https://prow.kcp.k8c.io/?job=pull-kcp-test-integration

Expected Behaviour

The test should not flake

Additional Context

No response

ntnn avatar Aug 08 '25 11:08 ntnn

/kind flake

ntnn avatar Aug 08 '25 12:08 ntnn

https://s3.eu-west-1.amazonaws.com/prow-public-data/pr-logs/pull/kcp-dev_kcp/3565/pull-kcp-test-integration/1962839491542519808/build-log.txt?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAUXHT7IH25XHMMYM5%2F20250902%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20250902T115100Z&X-Amz-Expires=600&X-Amz-SignedHeaders=host&X-Amz-Signature=cb52e9f180c03110c2ef1cab917bd84ced89f1f5028c896b92018872c861df3d

ntnn avatar Sep 02 '25 11:09 ntnn

Issues go stale after 90d of inactivity. After a furter 30 days, they will turn rotten. Mark the issue as fresh with /remove-lifecycle stale.

If this issue is safe to close now please do so with /close.

/lifecycle stale

kcp-ci-bot avatar Dec 01 '25 20:12 kcp-ci-bot

/remove-lifecycle stale

ntnn avatar Dec 01 '25 21:12 ntnn