cockroach icon indicating copy to clipboard operation
cockroach copied to clipboard

roachtest: fail tests if monitor encounters an error

Open renatolabs opened this issue 1 year ago • 2 comments

This commit updates the roachprod and roachtest monitors to 1) send an event when the monitor is abruptly terminated (i.e., reader stream sees an EOF when the associated context is not canceled); and 2) return any errors encountered by the roachprod monitor in roachtest, causing the currently running test to fail. The error has TestEng ownership so that teams are not be pinged on these kinds of flakes.

The main purpose of this change is for the monitor to fail in situations where the monitored node is preempted by the cloud provider. Previously, these events would be ignored, leading to a test timeout, wasting resources and leading to confusing test failures being reported on GitHub.

Fixes: #118563.

Release note: None

renatolabs avatar Feb 22 '24 20:02 renatolabs

This change is Reviewable

cockroach-teamcity avatar Feb 22 '24 20:02 cockroach-teamcity

Verified that the logic here works by simulating a VM preemption (i.e., running a roachtest and manually deleting one of the VMs on GCE). 0.1 build is also currently in progress.

Do others think this approach is reasonable? I also played with a different approach where roachtest continuously monitors for preempted VMs, but I think it's more general to have the monitor cause the test to fail on errors, which should fix the timeouts we observed with VM preemption.

Let me know!

renatolabs avatar Feb 22 '24 20:02 renatolabs

Do others think this approach is reasonable? I also played with a different approach where roachtest continuously monitors for preempted VMs, but I think it's more general to have the monitor cause the test to fail on errors, which should fix the timeouts we observed with VM preemption.

Let me know!

I like the current approach for its parsimony! It's also more general, i.e., not specific to preemption. Thus, we can now monitor (pun intended) for these types of "infra flake" while at the same time reducing the noise due to unattributed test failures.

srosenberg avatar Feb 27 '24 16:02 srosenberg

TFTR!

bors r=srosenberg

renatolabs avatar Feb 28 '24 19:02 renatolabs

Build succeeded:

craig[bot] avatar Feb 28 '24 20:02 craig[bot]