kcp icon indicating copy to clipboard operation
kcp copied to clipboard

Add `test/integration`

Open ntnn opened this issue 10 months ago • 13 comments

Summary

Adding a framework for integration tests.

Startup takes a long time - I am contemplating adding support for a shared embedded etcd (or an external etcd like in kube) if that significantly improves startup.

What Type of PR Is This?

/kind feature

Related Issue(s)

Release Notes

NONE

ntnn avatar Apr 14 '25 07:04 ntnn

Hi @ntnn. Thanks for your PR.

I'm waiting for a kcp-dev member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kcp-ci-bot avatar Apr 14 '25 07:04 kcp-ci-bot

Tests with race condition testing fails. I think kube also doesn't race test their integration tests.

ntnn avatar Apr 14 '25 08:04 ntnn

/ok-to-test

embik avatar Apr 14 '25 09:04 embik

/test pull-kcp-test-unit

ntnn avatar Apr 14 '25 13:04 ntnn

/test pull-kcp-test-e2e-shared

ntnn avatar Apr 14 '25 16:04 ntnn

Startup takes a long time - I am contemplating adding support for a shared embedded etcd (or an external etcd like in kube) if that significantly improves startup.

This is kind of a known problem with starting kcp the first time, so we should probably tackle this in kcp itself, not the integration test framework.

embik avatar Apr 15 '25 06:04 embik

Yeah, I just measured the startup times - etcd takes less than a second, kcp phase1 takes roughly 22s. And using a shared kcd for the integration tests would make less sense imho - if the kcp instance could be shared without tests affecting each other they could likely also be e2e tests.

ntnn avatar Apr 15 '25 07:04 ntnn

Running with race detection works. Root cause was that this method could be called from multiple goroutines if no DialContext is set: https://github.com/kcp-dev/kcp/blob/a2014d6bc9c7585fa55dc1d1b4f8476a805e8b86/pkg/network/dialer.go#L39-L48

This originates from the server config here: https://github.com/kcp-dev/kcp/blob/a2014d6bc9c7585fa55dc1d1b4f8476a805e8b86/pkg/server/config.go#L208-L210

So either the wrapper function is wrapped with a mutex to prevent concurrent writes or the wrapper function acts atomically. I don't think the GenericAPIServer can be persuaded to call this initially.

I don't particulalry like either option. ~~Another option could be to just remove that config - TCP timeout should be 10m anyhow iirc.~~ Scratch that, I read 25 * time.Minute - but it's 25 * time.Second:

https://github.com/kcp-dev/kcp/blob/a2014d6bc9c7585fa55dc1d1b4f8476a805e8b86/pkg/network/dialer_linux.go#L35-L41

ntnn avatar Apr 15 '25 11:04 ntnn

Technically it would maybe possible to set .Transport on the GenericAPIConfig with a transport that already has the TCP timeout set.

Maybe there was a good reason not to use .Transport and instead to define a wrapper: https://github.com/kcp-dev/kcp/commit/9574eefc68c56e0f297643b000e77135eb095c6a#diff-d84eac55294d8f540a660bfcb4043f7c4e1f76c8bccbb91c129e8e4992dbafe7R216

However I can't discern a reason from reading the documentation - it even reads more like setting the transport would be the right approach to setting the TCP timeout:

https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/rest/config.go#L104-L107

	// Transport may be used for custom HTTP behavior. This attribute may not
	// be specified with the TLS client certificate options. Use WrapTransport
	// to provide additional per-server middleware behavior.
	Transport http.RoundTripper

From attempting to set this attribute instead of wrap - I'd guess wrapping was used because setting a transport requires setting up a transport how the library expects it rather than having the library provide a working transport and just adjusting the timeouts.

ntnn avatar Apr 15 '25 12:04 ntnn

/test pull-kcp-test-e2e-multiple-runs

ntnn avatar Apr 15 '25 12:04 ntnn

Found out why .Transport isn't used. When this is set this if takes hold:

https://github.com/kubernetes/kubernetes/blob/30469e180361d7da07b0fee6d47c776fa2cf3e86/staging/src/k8s.io/client-go/transport/transport.go#L34-L40

// New returns an http.RoundTripper that will provide the authentication
// or transport level security defined by the provided Config.
func New(config *Config) (http.RoundTripper, error) {
	// Set transport level security
	if config.Transport != nil && (config.HasCA() || config.HasCertAuth() || config.HasCertCallback() || config.TLS.Insecure) {
		return nil, fmt.Errorf("using a custom transport with TLS certificate options or the insecure flag is not allowed")
	}

So the TLS configuration would have to be removed, which then causes connections to fail due to the missing TLS configuration:

E0415 14:28:52.537341   14738 storage_rbac.go:191] "Unhandled Error" err="unable to initialize clusterroles: Get \"https://127.0.0.1:57059/clusters/system:admin/apis/rbac.authorization.k8s.io/v1/clusterroles\": tls: failed to verify certificate: x509: certificate signed by unknown authority" logger="UnhandledError"
F0415 14:28:52.538522   14738 hooks.go:210] PostStartHook "kcp-bootstrap-policy" failed: unable to initialize roles: timed out waiting for the condition
FAIL    github.com/kcp-dev/kcp/test/integration/framework       39.978s

https://github.com/kcp-dev/kcp/commit/a07f6105c1016aac468dbb6147ce47021eba8a62#diff-d84eac55294d8f540a660bfcb4043f7c4e1f76c8bccbb91c129e8e4992dbafe7R210

Technically a bug, but that was introduced >7y ago - for a fix on the KCP side for now we'll have to go with either of the workarounds.

ntnn avatar Apr 15 '25 12:04 ntnn

Very nice!

About startup, I think we need somebody debugging our startup process. I think there must be some polling that is slow. At some point, we have split the kcp core informers into one part that does not need identities (system CRDs) and those resource based on APIExports (which need identities). And there is logic to bootstrap in a multi-shard env. Something in this area must be far too slow. Maybe some polling with too long interval.

sttts avatar Apr 23 '25 20:04 sttts

Marking as WIP as I want to change how the server is started

ntnn avatar Apr 26 '25 07:04 ntnn

The goleak ignore list is pretty long but I'm fairly certain that quite a few are just downstream of other issues.

ntnn avatar May 13 '25 09:05 ntnn

LGTM label has been added.

Git tree hash: 3f155fed1096611d0ced7799a303725e25426e9b

kcp-ci-bot avatar May 14 '25 09:05 kcp-ci-bot

/lgtm /approve

mjudeikis avatar May 19 '25 13:05 mjudeikis

LGTM label has been added.

Git tree hash: 40a308411262a2b308e7b33218fabbd9051a7f92

kcp-ci-bot avatar May 19 '25 13:05 kcp-ci-bot

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mjudeikis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

kcp-ci-bot avatar May 19 '25 13:05 kcp-ci-bot