cockroach icon indicating copy to clipboard operation
cockroach copied to clipboard

Roachtest redirect SSH flakes to test-eng

Open smg260 opened this issue 2 years ago • 1 comments

This PR inspects the failure output of a roachtest, and if it sees an SSH_PROBLEM, overrides the owning team to test-eng when reporting the github issue.

Currently errors are classified as an SSH error by roachprod if the exit code is 255 with an accompanying message prefixed with SSH_PROBLEM [1]. The errors are stringified and saved into t.mu.output|failureMsg. Thus in the test_runner at the call site of issue posting, we can check t.mu.output for SSH_PROBLEM and override the team and issue name accordingly.

Resolves: https://github.com/cockroachdb/cockroach/issues/82398

Release justification: test-only change Release note: none

smg260 avatar Sep 22 '22 18:09 smg260

This change is Reviewable

cockroach-teamcity avatar Sep 22 '22 18:09 cockroach-teamcity

We can also build on top of https://github.com/cockroachdb/cockroach/pull/88556 for this, I have a DM going with Miral.

tbg avatar Sep 23 '22 11:09 tbg

@smg260 (correct me if I'm wrong please) is getting https://github.com/cockroachdb/cockroach/pull/88556 in first and then re-working this to build on top of that PR.

I'm unassigning for review so that it doesn't show up in my queue in the meantime.

tbg avatar Sep 26 '22 07:09 tbg

@tbg Yep. Marking as draft for now.

smg260 avatar Sep 27 '22 13:09 smg260

This is rebased on top of https://github.com/cockroachdb/cockroach/pull/88556 which introduces structured errors.

Detecting the SSH flake via substring still works and keeps this PR simple.

This might be a good place to introduce some retry logic for running the SSH commands

smg260 avatar Oct 04 '22 14:10 smg260

Detecting the SSH flake via substring still works and keeps this PR simple.

A chance of a false positive (via substring) is probably very low, but structured errors should make it cleaner and pave the way for more cases. I am thinking of a switch statement on something like errCategory with cluster creation and ssh errors injecting the corresponding category at the site?

This might be a good place to introduce some retry logic for running the SSH commands

There is a separate issue to deal with SSH retries: https://github.com/cockroachdb/cockroach/issues/73542

srosenberg avatar Oct 06 '22 04:10 srosenberg

I recently spotted another issue... seeing this error message quite often during cluster teardown,

teardown: 16:53:25 cluster.go:1149: failed to fetch logs: cluster.Get: get logs failed

The above error message originates in [1]. The problem is that the actual error message is swallowed [2] because l.File is usually non-nil, when invoked from roachtest. (The logger is essentially the root logger and l.File is test.log.) l believe the same issue applies to Put.

[1] https://github.com/cockroachdb/cockroach/blob/master/pkg/roachprod/install/cluster_synced.go#L2034 [2] https://github.com/cockroachdb/cockroach/blob/master/pkg/roachprod/install/cluster_synced.go#L2028

srosenberg avatar Oct 06 '22 04:10 srosenberg

@smg260 I pointed a couple of misplaced comments on https://github.com/cockroachdb/cockroach/pull/88556. Maybe we could fix those here?

renatolabs avatar Oct 06 '22 15:10 renatolabs

A chance of a false positive (via substring) is probably very low, but structured errors should make it cleaner and pave the way for more cases. I am thinking of a switch statement on something like errCategory with cluster creation and ssh errors injecting the corresponding category at the site?

That's definitely the goal, and my thoughts were to introduce that when we switch to cockroachdb/errors. We'd then be able to annotate errors using something like errors.Mark(err, SSHerr) or `errors.Mark(err, ClusterCreationErr) at the point we detect it. (tbg's idea).

So we can keep this PR tight in scope and address just the SSH flakes, or expand it to more generically handle different error types. The PR does at least currently constrain the logic to to determine the owning team in githib.go (as it should).

There is a separate issue to deal with SSH retries: #73542

👍 This was getting a bit gnarly to implement in my test branch

smg260 avatar Oct 10 '22 19:10 smg260

As general food for thought for everyone: IMO, errors like cluster creation and SSH problems are better reported in a more concise way other than a GitHub issue. We just want to be notified about them, but there's little (nothing?) we can do about them other than acknowledge and close the issue when it's opened.

One idea is to post something on Slack (in the existing test-eng-ops) channel with a summary of a roachtest nightly run, including how many tests were skipped due to cluster creation errors, SSH flakes, etc.

I agree that some errors are better reported via other mediums such as Slack, but having the same error reported a shared github issue (or similar) provides a better system of record. Ultimately, where an issue is posting should be agnostic to the caller and the posting code could drive which channels are notified via some coniguration

smg260 avatar Oct 10 '22 19:10 smg260

The second commit switches to using errors/marker api to mark an SSH flake where it is detected in roachprod. This allows us to look for a marker exception and act on it accordingly.

Summary of changes:

  1. switch from []error to []failure in t.mu where failure is a struct containing all errors which were passed as args. The impetus for this was to preserve secondary+ exceptions which are passed into the t.Error/Fatal methods. Currently when t.Error(someErr) is invoked, a new error is created, moving someErr to a secondary error, out of the primary error chain. This makes is very hard to search for the marker which would have been inserted at someErr but is now not visible to errors.Is(..)
  2. Add ErrSSH255 and errClusterProvisioningFailed reference errors (the marker errors). We control what we post in the issue according to if we find these
  3. remove unused members of t.mu
  4. remove t.argsToErr in favour of errors.NewWithDepthf

smg260 avatar Oct 12 '22 00:10 smg260

Test name will be prepended to issue body in the case of cluster creation AND ssh flakes.

Other conversations resolved. Will merge after TC builds are green.

smg260 avatar Oct 24 '22 19:10 smg260

bors r=tbg

smg260 avatar Oct 25 '22 17:10 smg260

Build failed (retrying...):

craig[bot] avatar Oct 25 '22 18:10 craig[bot]

This PR was included in a batch that was canceled, it will be automatically retried

craig[bot] avatar Oct 25 '22 19:10 craig[bot]

bors retry

smg260 avatar Oct 25 '22 19:10 smg260

Already running a review

craig[bot] avatar Oct 25 '22 19:10 craig[bot]

Build succeeded:

craig[bot] avatar Oct 25 '22 21:10 craig[bot]