linkerd2 icon indicating copy to clipboard operation
linkerd2 copied to clipboard

Tracing test is flakey

Open olix0r opened this issue 3 years ago • 6 comments

The tracing test frequently fails with spurious errors like:

--- FAIL: TestTracing (179.94s)
    --- FAIL: TestTracing/expect_full_trace (120.03s)
        tracing_test.go:224: No trace found with processes: [nginx web emoji voting linkerd-proxy]
FAIL
FAIL	github.com/linkerd/linkerd2/test/integration/tracing	180.065s
FAIL

That is, after running other tests, we wait 3 minutes for the test to fail, invalidating the whole CI run.

I'm going to disable this test until we can get at the root of this flakiness.

See also #7403

olix0r avatar Dec 28 '21 17:12 olix0r

So, I tried to dig more on this test, and have the following findings:

  • Currently, It seems to be one of the slowest test suite taking more than 3m. Deploying and waiting for linkerd-emojivoto seems to be the main reason here. I see that we seem to deploy emojivoto across the policyintegration test. It would be better instead to deploy once during the install_test.go and use it across. (Tests that need to have additional inject configuration will need to do the update, and perform a rollout)
  • For flakiness, The reason seems to be that we always expect the full trace across [nginx web emoji voting linkerd-proxy] to be present. We could do the following to improve this
    • Increase the time out. Currently we wait for 120s but due to multiple components being involved. I think improving this might help.
    • Test only for the linkerd-proxy span. So, I don't think we should be waiting for the full trace as some of the application spans maybe missing for various reasons.

@mateiidavid WDYT?

Pothulapati avatar Jan 06 '22 12:01 Pothulapati

@Pothulapati do we know how long it usually takes for emojivoto to deploy? I wonder why it would take so long for it to start-up, we make use of it in other tests and I don't think they take as long as tracing? 🤔 We also seem to install linkerd-jaeger. As part of https://github.com/linkerd/linkerd2/issues/7403, I want to create a viz suite. I wonder if it would be better to install jaeger when we install the control plane and viz for the test suite, that might shave some seconds off.

@Pothulapati I looked through the tracing test and there are a couple of things I think we can do:

  1. With all resources in the test, we seem to be reading them from a file and then injecting. We could perhaps optimize here a bit and inject them before replacing the configuration parameters.
  2. linkerd-jaeger is deployed as part of the test. We should maybe pull that out; we could install it in install_test.go. As part of https://github.com/linkerd/linkerd2/issues/7403 we'll create a viz suite so I think it would make sense to install jaeger at the same time as viz.
  3. If you're comfortable with only testing for linkerd-proxy span, then we can probably do that. It seems sufficient to me, I think you may know more on the topic though.

We could move emojivoto to install_test but one advantage of having it structured this way is that all resources are properly cleaned up after the test runs. We'd have to clean up the resources on our own and I think we might want to avoid an explicit clean-up test in install.go. I'm surprised it takes so long to install it though, perhaps we can offset by trying to make use of a cache for external dependencies? It was suggested by @alpeb but I never got around to trying it out properly. If we can cache the images for our dependencies (e.g emojivoto, rabbitmq, etc.) then we don't have to deploy it ahead of time maybe.

One solution from my side would be to get rid of emojivoto and just rely on slow-cooker and nginx (the latter is already used in the test).

Edit: raised https://github.com/linkerd/linkerd2/issues/7587 to investigate the possibility of caching the images for emojivoto, maybe that might help?

mateiidavid avatar Jan 11 '22 16:01 mateiidavid

I wonder why it would take so long for it to start-up, we make use of it in other tests and I don't think they take as long as tracing

I had a similar question too and I think the reason here is that the tracing test seems to spin them up for the first time, and hence the image pulls end up taking some time? . Pulling the images AOT might definitely help I think.

We could move emojivoto to install_test but one advantage of having it structured this way is that all resources are properly cleaned up after the test runs. We'd have to clean up the resources on our own and I think we might want to avoid an explicit clean-up test in install.go

Yeah, As each test seems to have its own inject configuration, this does not seem like a great way to fix this. We should probably start with AOT image loading, and then see if its still slow?

Pothulapati avatar Jan 11 '22 17:01 Pothulapati

I did some testing and realized the bottleneck is the loading of the nginx-ingress image. Unpacked it weights 555MB. Upgrading to 0.33.0 results in 327MB, but that'd require updating other things in the test. OTOH, preloading the image didn't make much of a difference to me in terms of total time, but maybe that can help avoiding the flakiness. Replacing that ingress with another one might also be another avenue to improve times...

alpeb avatar Jan 12 '22 14:01 alpeb

@alpeb That's a good idea! I will start working on this i.e performing the same test without using nginx-ingress

Pothulapati avatar Jan 12 '22 16:01 Pothulapati

Notably, even the merge of #7601 (f2ad5a9 log) failed on this test:

--- FAIL: TestPolicy (494.98s)
    --- FAIL: TestPolicy/linkerd_viz_stat_srv_-n_linkerd-stat-authz-test (180.35s)
        policy_test.go:146: Unexpected success rate for [emoji-grpc], got [-]
FAIL
FAIL	github.com/linkerd/linkerd2/test/integration/policy	495.078s
FAIL

I've seen other branches fail on main as well, like #7623 (0704f1bef1c09984b1c137d73efef6345f047bb7 log):

--- FAIL: TestTracing (207.01s)
    --- FAIL: TestTracing/expect_full_trace (180.04s)
        tracing_test.go:169: No trace found with processes: linkerd-proxy
FAIL
FAIL	github.com/linkerd/linkerd2/test/integration/tracing	207.116s
FAIL

olix0r avatar Jan 18 '22 00:01 olix0r