cockroach
cockroach copied to clipboard
Roachtest redirect SSH flakes to test-eng
This PR inspects the failure output of a roachtest, and if it sees an SSH_PROBLEM, overrides the owning team to test-eng when reporting the github issue.
Currently errors are classified as an SSH
error by roachprod if the exit code is 255
with an accompanying message prefixed with SSH_PROBLEM
[1]. The errors are stringified and saved into t.mu.output|failureMsg
. Thus in the test_runner at the call site of issue posting, we can check t.mu.output
for SSH_PROBLEM
and override the team and issue name accordingly.
Resolves: https://github.com/cockroachdb/cockroach/issues/82398
Release justification: test-only change Release note: none
We can also build on top of https://github.com/cockroachdb/cockroach/pull/88556 for this, I have a DM going with Miral.
@smg260 (correct me if I'm wrong please) is getting https://github.com/cockroachdb/cockroach/pull/88556 in first and then re-working this to build on top of that PR.
I'm unassigning for review so that it doesn't show up in my queue in the meantime.
@tbg Yep. Marking as draft for now.
This is rebased on top of https://github.com/cockroachdb/cockroach/pull/88556 which introduces structured errors.
Detecting the SSH flake via substring still works and keeps this PR simple.
This might be a good place to introduce some retry logic for running the SSH commands
Detecting the SSH flake via substring still works and keeps this PR simple.
A chance of a false positive (via substring) is probably very low, but structured errors should make it cleaner and pave the way for more cases. I am thinking of a switch statement on something like errCategory
with cluster creation and ssh errors injecting the corresponding category at the site?
This might be a good place to introduce some retry logic for running the SSH commands
There is a separate issue to deal with SSH retries: https://github.com/cockroachdb/cockroach/issues/73542
I recently spotted another issue... seeing this error message quite often during cluster teardown,
teardown: 16:53:25 cluster.go:1149: failed to fetch logs: cluster.Get: get logs failed
The above error message originates in [1]. The problem is that the actual error message is swallowed [2] because l.File
is usually non-nil, when invoked from roachtest
. (The logger is essentially the root logger and l.File
is test.log
.) l believe the same issue applies to Put
.
[1] https://github.com/cockroachdb/cockroach/blob/master/pkg/roachprod/install/cluster_synced.go#L2034 [2] https://github.com/cockroachdb/cockroach/blob/master/pkg/roachprod/install/cluster_synced.go#L2028
@smg260 I pointed a couple of misplaced comments on https://github.com/cockroachdb/cockroach/pull/88556. Maybe we could fix those here?
A chance of a false positive (via substring) is probably very low, but structured errors should make it cleaner and pave the way for more cases. I am thinking of a switch statement on something like
errCategory
with cluster creation and ssh errors injecting the corresponding category at the site?
That's definitely the goal, and my thoughts were to introduce that when we switch to cockroachdb/errors
. We'd then be able to annotate errors using something like errors.Mark(err, SSHerr)
or `errors.Mark(err, ClusterCreationErr) at the point we detect it. (tbg's idea).
So we can keep this PR tight in scope and address just the SSH flakes, or expand it to more generically handle different error types. The PR does at least currently constrain the logic to to determine the owning team in githib.go
(as it should).
There is a separate issue to deal with SSH retries: #73542
👍 This was getting a bit gnarly to implement in my test branch
As general food for thought for everyone: IMO, errors like cluster creation and SSH problems are better reported in a more concise way other than a GitHub issue. We just want to be notified about them, but there's little (nothing?) we can do about them other than acknowledge and close the issue when it's opened.
One idea is to post something on Slack (in the existing test-eng-ops) channel with a summary of a roachtest nightly run, including how many tests were skipped due to cluster creation errors, SSH flakes, etc.
I agree that some errors are better reported via other mediums such as Slack, but having the same error reported a shared github issue (or similar) provides a better system of record. Ultimately, where an issue is posting should be agnostic to the caller and the posting code could drive which channels are notified via some coniguration
The second commit switches to using errors/marker api to mark an SSH flake where it is detected in roachprod. This allows us to look for a marker exception and act on it accordingly.
Summary of changes:
- switch from
[]error
to[]failure
int.mu
wherefailure
is a struct containing all errors which were passed as args. The impetus for this was to preserve secondary+ exceptions which are passed into thet.Error/Fatal
methods. Currently whent.Error(someErr)
is invoked, a new error is created, movingsomeErr
to asecondary
error, out of the primary error chain. This makes is very hard to search for the marker which would have been inserted atsomeErr
but is now not visible toerrors.Is(..)
- Add
ErrSSH255
anderrClusterProvisioningFailed
reference errors (the marker errors). We control what we post in the issue according to if we find these - remove unused members of
t.mu
- remove
t.argsToErr
in favour oferrors.NewWithDepthf
Test name will be prepended to issue body in the case of cluster creation AND ssh flakes.
Other conversations resolved. Will merge after TC builds are green.
bors r=tbg
This PR was included in a batch that was canceled, it will be automatically retried
bors retry
Already running a review