cluster-api-provider-azure icon indicating copy to clipboard operation
cluster-api-provider-azure copied to clipboard

Enable race detector for unit tests

Open mboersma opened this issue 3 years ago • 5 comments

What type of PR is this?

/kind failing-test

What this PR does / why we need it:

Enables the Go race detector in unit tests, because that seems like a best practice. However, it does more than double the time:

% git co ill-race-you && time TEST_ARGS="-race -count=1" make go-test
...
TEST_ARGS="-race -count=1" make go-test  518.22s user 31.92s system 447% cpu 2:02.92 total
% git co main && time TEST_ARGS="-count=1" make go-test
...
TEST_ARGS="-count=1" make go-test  144.72s user 20.40s system 343% cpu 48.067 total

Maybe we could leave -race enabled for CI but users can disable it locally via TEST_ARGS? I don't want to make tests take longer, but this flag already found two data races.

Which issue(s) this PR fixes:

N/A

Special notes for your reviewer:

If your tests have incomplete coverage, you may find more races by running a binary built with -race under a realistic workload.

This sentence in the race detector docs is very intriguing to me. I wonder if we could build a debug binary with -race that we used in CI; I have a strong intuition it would find something.

Or maybe add RACE_DETECTOR as a toggle for Tilt binaries? Edit: see #2727.

TODOs:

  • [x] squashed commits
  • [ ] includes documentation
  • [ ] adds unit tests

Release note:

NONE

mboersma avatar Sep 12 '22 19:09 mboersma

/retitle [WIP] Enable race detector for unit tests

mboersma avatar Sep 12 '22 19:09 mboersma

This is cool! Which part is still [WIP]?

CecileRobertMichon avatar Oct 14 '22 00:10 CecileRobertMichon

Which part is still [WIP]?

Just that I'm wary of more than doubling the time for make go-test. Increasing that part of the developer loop or a CI job from 3 to 8 minutes (for example) is painful.

Users can always disable race checking with TEST_ARGS="". If we're ok with that option and with the default for unit tests taking longer, I think we should merge this.

mboersma avatar Oct 14 '22 19:10 mboersma

/retitle Enable race detector for unit tests

mboersma avatar Oct 14 '22 19:10 mboersma

Just that I'm wary of more than doubling the time for make go-test. Increasing that part of the developer loop or a CI job from 3 to 8 minutes (for example) is painful.

That's not great. What if we add it to the make test target but not make go-test? That way devs can still use make go-test locally to run tests quickly but the PR job will run this with generate and lint?

CecileRobertMichon avatar Nov 07 '22 21:11 CecileRobertMichon

My vote is a new periodic job

jackfrancis avatar Nov 07 '22 21:11 jackfrancis

What if we add it to the make test target but not make go-test?

I like this idea: we should run it in CI but not make developers wait longer. I made those changes by adding a go-test-race target. @jackfrancis wdyt?

mboersma avatar Nov 08 '22 21:11 mboersma

@mboersma are we sure this is working? the most test run for this PR was within the expected time duration range of previous runs (withough the -race option)

jackfrancis avatar Nov 08 '22 22:11 jackfrancis

indeed

make test TEST_ARGS=" -coverprofile=coverage.out"

Looks like test-cover is setting its own args

/lgtm cancel

CecileRobertMichon avatar Nov 08 '22 22:11 CecileRobertMichon

Oh, I see we actually use make test-cover in CI. We can set the args for that similarly.

mboersma avatar Nov 08 '22 23:11 mboersma

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from cecilerobertmichon by writing /assign @cecilerobertmichon in a comment. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Nov 08 '22 23:11 k8s-ci-robot

@mboersma: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-azure-e2e-exp 17b8fdba9993035733750d76da4dc449d3a7cea1 link false /test pull-cluster-api-provider-azure-e2e-exp

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-ci-robot avatar Nov 09 '22 00:11 k8s-ci-robot

IDK why the e2e-exp job ran, but it failed to create an AKS cluster. I'm ignoring that flake since it's not a required check, and this only affects the unit tests.

mboersma avatar Nov 09 '22 16:11 mboersma

IDK why the e2e-exp job ran

it runs whenever we change the Makefile and other common CI files

that last run was suspiciously fast and I don't see anything in the output of the test run that indicates it ran the race detector, how do we know it's working?

CecileRobertMichon avatar Nov 09 '22 23:11 CecileRobertMichon

I added temp commits to log the go test command line and to reintroduce the race conditions. I see the -race flag being passed through and indeed the unit tests failed:

--- FAIL: TestReconcileInboundNATRule (0.00s)
    --- FAIL: TestReconcileInboundNATRule/fail_to_create_NAT_rule (0.00s)
        testing.go:1319: race detected during execution of test
    --- FAIL: TestReconcileInboundNATRule/NAT_rule_successfully_created_with_with_no_existing_rules (0.00s)
        testing.go:1319: race detected during execution of test
    --- FAIL: TestReconcileInboundNATRule/No_LB,_Nat_rule_reconciliation_is_skipped (0.00s)
        testing.go:1319: race detected during execution of test

It does run relatively quickly in CI: 6 or 7 minutes. But I'll take it since it's working as intended.

mboersma avatar Dec 07 '22 18:12 mboersma

@CecileRobertMichon @jackfrancis this is ready for review.

mboersma avatar Dec 07 '22 20:12 mboersma

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jackfrancis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Dec 08 '22 23:12 k8s-ci-robot