cockroach
cockroach copied to clipboard
roachtest: weekly/tpcc/headroom failed
roachtest.weekly/tpcc/headroom failed with artifacts on master @ f117eea22dd7be380c7141cdf6cd7aba92dd9c70:
(monitor.go:154).Wait: monitor failure: full command output in run_073655.363388731_n4_cockroach-workload-r.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/weekly/tpcc/headroom/run_1
Parameters:
-
ROACHTEST_arch=amd64
-
ROACHTEST_cloud=gce
-
ROACHTEST_coverageBuild=false
-
ROACHTEST_cpu=16
-
ROACHTEST_encrypted=true
-
ROACHTEST_fs=ext4
-
ROACHTEST_localSSD=true
-
ROACHTEST_metamorphicBuild=false
-
ROACHTEST_ssd=0
This test on roachdash | Improve this report!
Jira issue: CRDB-37891
Workload saw the following error
Error: error in delivery: ERROR: result is ambiguous: error=r265/3:(n3,s3) is unavailable (circuit breaker tripped): context canceled [propagate] (last error: intent missing: "sql txn" meta={id=58a41f40 key=/Table/111/1/945/1 iso=Serializable pri=0.00913027 epo=0 ts=1713274007.574041942,0 min=1713274007.574041942,0 seq=0} lock=true stat=PENDING rts=1713274007.574041942,0 wto=false gul=1713274008.074041942,0) (SQLSTATE 40003)
Which is the same error observed in last week's failure (#121997). This failure includes #122255, so I'm guessing that change didn't completely fix the issue. Reassigning to KV.
@nvanbenschoten, coming back to this again, and I'm wondering if the issue here has to do with the ambiguous nature of the error being returned. The rationale behind why we're marking the error ambiguous is here:
https://github.com/cockroachdb/cockroach/blob/f7178a433f53ddf2002e6aef8b36fa8aa59cda9f/pkg/kv/kvclient/kvcoord/dist_sender.go#L2972-L2976
That doesn't apply to pre-commit QueryIntentRequests
though, right? If we're getting an IntentMissingError
, we'd expect it to be handled by divideAndSendParallelCommit
correctly. This also comes back to what you said on Slack:
"I’m also a little confused about why we set withCommit for the “pre-commit query intent” sub-batch and how this is supposed to work. The batch is read-only, so we should be free to retry it. We should then be handling ambiguity up above in divideAndSendParallelCommit."
I still don't follow the last bit about handling ambiguity in divideAndSendParallelCommit
, but I was wondering if a diff like below would work:
--- a/pkg/kv/kvclient/kvcoord/dist_sender.go
+++ b/pkg/kv/kvclient/kvcoord/dist_sender.go
@@ -1393,7 +1393,8 @@ func (ds *DistSender) divideAndSendParallelCommit(
// Send the batch with withCommit=true since it will be inflight
// concurrently with the EndTxn batch below.
- reply, pErr := ds.divideAndSendBatchToRanges(ctx, qiBa, qiRS, qiIsReverse, true /* withCommit */, qiBatchIdx)
+ // TODO(XXX): update comment ^.
+ reply, pErr := ds.divideAndSendBatchToRanges(ctx, qiBa, qiRS, qiIsReverse, false /* withCommit */, qiBatchIdx)
qiResponseCh <- response{reply: reply, positions: positions, pErr: pErr}
}); err != nil {
return nil, kvpb.NewError(err)
Removing the GA-blocker
because this issue is caused by DistSender circuit breakers, which are being disabled by default. See https://github.com/cockroachdb/cockroach/issues/122983.
roachtest.weekly/tpcc/headroom failed with artifacts on master @ c4ab095c4f65b9140661ed57adddc690b1e3ce3f:
(monitor.go:154).Wait: monitor failure: full command output in run_081416.890588441_n4_cockroach-workload-r.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/weekly/tpcc/headroom/run_1
Parameters:
-
ROACHTEST_arch=amd64
-
ROACHTEST_cloud=gce
-
ROACHTEST_coverageBuild=false
-
ROACHTEST_cpu=16
-
ROACHTEST_encrypted=true
-
ROACHTEST_fs=ext4
-
ROACHTEST_localSSD=true
-
ROACHTEST_metamorphicBuild=false
-
ROACHTEST_ssd=0
We have marked this test failure issue as stale because it has been inactive for 1 month. If this failure is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 5 days to keep the test failure queue tidy.