cockroach icon indicating copy to clipboard operation
cockroach copied to clipboard

roachtest: weekly/tpcc/headroom failed

Open cockroach-teamcity opened this issue 10 months ago • 4 comments

roachtest.weekly/tpcc/headroom failed with artifacts on master @ f117eea22dd7be380c7141cdf6cd7aba92dd9c70:

(monitor.go:154).Wait: monitor failure: full command output in run_073655.363388731_n4_cockroach-workload-r.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/weekly/tpcc/headroom/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=true
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/test-eng

This test on roachdash | Improve this report!

Jira issue: CRDB-37891

cockroach-teamcity avatar Apr 16 '24 13:04 cockroach-teamcity

Workload saw the following error

Error: error in delivery: ERROR: result is ambiguous: error=r265/3:(n3,s3) is unavailable (circuit breaker tripped): context canceled [propagate] (last error: intent missing: "sql txn" meta={id=58a41f40 key=/Table/111/1/945/1 iso=Serializable pri=0.00913027 epo=0 ts=1713274007.574041942,0 min=1713274007.574041942,0 seq=0} lock=true stat=PENDING rts=1713274007.574041942,0 wto=false gul=1713274008.074041942,0) (SQLSTATE 40003)

Which is the same error observed in last week's failure (#121997). This failure includes #122255, so I'm guessing that change didn't completely fix the issue. Reassigning to KV.

renatolabs avatar Apr 16 '24 16:04 renatolabs

@nvanbenschoten, coming back to this again, and I'm wondering if the issue here has to do with the ambiguous nature of the error being returned. The rationale behind why we're marking the error ambiguous is here:

https://github.com/cockroachdb/cockroach/blob/f7178a433f53ddf2002e6aef8b36fa8aa59cda9f/pkg/kv/kvclient/kvcoord/dist_sender.go#L2972-L2976

That doesn't apply to pre-commit QueryIntentRequests though, right? If we're getting an IntentMissingError, we'd expect it to be handled by divideAndSendParallelCommit correctly. This also comes back to what you said on Slack:

"I’m also a little confused about why we set withCommit for the “pre-commit query intent” sub-batch and how this is supposed to work. The batch is read-only, so we should be free to retry it. We should then be handling ambiguity up above in divideAndSendParallelCommit."

I still don't follow the last bit about handling ambiguity in divideAndSendParallelCommit, but I was wondering if a diff like below would work:

--- a/pkg/kv/kvclient/kvcoord/dist_sender.go
+++ b/pkg/kv/kvclient/kvcoord/dist_sender.go
@@ -1393,7 +1393,8 @@ func (ds *DistSender) divideAndSendParallelCommit(

                // Send the batch with withCommit=true since it will be inflight
                // concurrently with the EndTxn batch below.
-               reply, pErr := ds.divideAndSendBatchToRanges(ctx, qiBa, qiRS, qiIsReverse, true /* withCommit */, qiBatchIdx)
+               // TODO(XXX): update comment ^.
+               reply, pErr := ds.divideAndSendBatchToRanges(ctx, qiBa, qiRS, qiIsReverse, false /* withCommit */, qiBatchIdx)
                qiResponseCh <- response{reply: reply, positions: positions, pErr: pErr}
        }); err != nil {
                return nil, kvpb.NewError(err)

arulajmani avatar Apr 16 '24 18:04 arulajmani

Removing the GA-blocker because this issue is caused by DistSender circuit breakers, which are being disabled by default. See https://github.com/cockroachdb/cockroach/issues/122983.

arulajmani avatar Apr 24 '24 16:04 arulajmani

roachtest.weekly/tpcc/headroom failed with artifacts on master @ c4ab095c4f65b9140661ed57adddc690b1e3ce3f:

(monitor.go:154).Wait: monitor failure: full command output in run_081416.890588441_n4_cockroach-workload-r.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/weekly/tpcc/headroom/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=true
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

cockroach-teamcity avatar Apr 27 '24 07:04 cockroach-teamcity

We have marked this test failure issue as stale because it has been inactive for 1 month. If this failure is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 5 days to keep the test failure queue tidy.

github-actions[bot] avatar Jun 17 '24 10:06 github-actions[bot]