cockroach
cockroach copied to clipboard
roachtest: tpcc-nowait/isolation-level=mixed/nodes=3/w=1 failed
roachtest.tpcc-nowait/isolation-level=mixed/nodes=3/w=1 failed with artifacts on master @ 04f0416d526a43741d22fd03966758dcccdeb79f:
(monitor.go:153).Wait: monitor failure: full command output in run_115730.097711628_n4_cockroach-workload-r.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/tpcc-nowait/isolation-level=mixed/nodes=3/w=1/run_1
Parameters:
-
ROACHTEST_arch=amd64
-
ROACHTEST_cloud=gce
-
ROACHTEST_coverageBuild=false
-
ROACHTEST_cpu=16
-
ROACHTEST_encrypted=true
-
ROACHTEST_metamorphicBuild=false
-
ROACHTEST_ssd=0
TransactionAbortedError
when running workload with --isolation-level=read_uncommitted --txn-retries=false
:
Error: error in delivery: ERROR: restart transaction: TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT_REASON_PUSHER_ABORTED): "sql txn" meta={id=d757695b key=/Table/111/1/2/1 iso=ReadCommitted pri=0.00665952 epo=0 ts=1708603167.770875286,0 min=1708603167.585376790,0 seq=0} lock=true stat=ABORTED rts=1708603167.770875286,0 wto=false gul=1708603168.085376790,0 (SQLSTATE 40001)
Pinging @cockroachdb/kv.
@arulajmani can you please take a look ?
I'm removing the release blocker label here. We're running this test without retry loops. The test failed because a pusher noticed someone aborted its transaction -- I'll see if I can figure out why this happened and report back, but this isn't the kind of thing we'd block releases over.
roachtest.tpcc-nowait/isolation-level=mixed/nodes=3/w=1 failed with artifacts on master @ 455b16592df7d8efd121b3ba1256fb477e227564:
(monitor.go:154).Wait: monitor failure: full command output in run_110849.001451613_n4_cockroach-workload-r.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/tpcc-nowait/isolation-level=mixed/nodes=3/w=1/run_1
Parameters:
-
ROACHTEST_arch=amd64
-
ROACHTEST_cloud=gce
-
ROACHTEST_coverageBuild=false
-
ROACHTEST_cpu=16
-
ROACHTEST_encrypted=false
-
ROACHTEST_metamorphicBuild=false
-
ROACHTEST_ssd=0
This last failure was different:
Error: error in newOrder: ERROR: internal error: unexpected error from the vectorized engine: null_value_error: non-nullable column "customer:c_last" with no value! Index scanned was "customer_pkey" with the index key columns (c_w_id,c_d_id,c_id) and the values (?,?,?)
cc @cockroachdb/sql-queries
Thanks @renatolabs, that's concerning. I'm going to mark as a GA-blocker until we know more.
Thanks @renatolabs, that's concerning. I'm going to mark as a GA-blocker until we know more.
Actually, I suspect there's a good chance this might be part of yesterday's bad run. @stevendanna?
roachtest.tpcc-nowait/isolation-level=mixed/nodes=3/w=1 failed with artifacts on master @ ec9444f8287663777c5ebceff61bcd280491ba08:
(monitor.go:154).Wait: monitor failure: full command output in run_110920.314162169_n4_cockroach-workload-r.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/tpcc-nowait/isolation-level=mixed/nodes=3/w=1/run_1
Parameters:
-
ROACHTEST_arch=amd64
-
ROACHTEST_cloud=gce
-
ROACHTEST_coverageBuild=false
-
ROACHTEST_cpu=16
-
ROACHTEST_encrypted=false
-
ROACHTEST_metamorphicBuild=false
-
ROACHTEST_ssd=0
roachtest.tpcc-nowait/isolation-level=mixed/nodes=3/w=1 failed with artifacts on master @ 7fc4c7bcbbf0c75a62d056da0bf79a5a32714650:
(monitor.go:154).Wait: monitor failure: full command output in run_110656.692822069_n4_cockroach-workload-r.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/tpcc-nowait/isolation-level=mixed/nodes=3/w=1/run_1
Parameters:
-
ROACHTEST_arch=amd64
-
ROACHTEST_cloud=gce
-
ROACHTEST_coverageBuild=false
-
ROACHTEST_cpu=16
-
ROACHTEST_encrypted=false
-
ROACHTEST_metamorphicBuild=false
-
ROACHTEST_ssd=0
roachtest.tpcc-nowait/isolation-level=mixed/nodes=3/w=1 failed with artifacts on release-24.1 @ d141d2c700eb1e323f97e15831c3b9915aba9248:
(monitor.go:154).Wait: monitor failure: full command output in run_114112.092215343_n4_cockroach-workload-r.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/tpcc-nowait/isolation-level=mixed/nodes=3/w=1/run_1
Parameters:
-
ROACHTEST_arch=amd64
-
ROACHTEST_cloud=gce
-
ROACHTEST_coverageBuild=false
-
ROACHTEST_cpu=16
-
ROACHTEST_encrypted=false
-
ROACHTEST_metamorphicBuild=false
-
ROACHTEST_ssd=0
In each of these cases, we are seeing a Read Committed transaction get aborted with a TransactionAbortedError(ABORT_REASON_PUSHER_ABORTED)
error during the delivery
transaction. I'm wondering whether this is a consequence of a lock ordering deadlock and deadlock detection kicking in. https://github.com/cockroachdb/cockroach/pull/122051 will help us to determine this.
I was able to reproduce with the new workload and cockroach logging.
We again see a transaction get aborted in the delivery transaction.
Error: error in delivery: select new_order failed: ERROR: restart transaction: TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT_REASON_PUSHER_ABORTED): "sql txn" meta={id=2f49b32d key=/Table/111/1/4/1 iso=ReadCommitted pri=0.00224044 epo=0 ts=1712776776.055651390,0 min=1712776776.006892555,0 seq=9} lock=true stat=ABORTED rts=1712776776.055651390,0 wto=false gul=1712776776.506892555,0 (SQLSTATE 40001)
This time, we know it's while executing this statement: https://github.com/cockroachdb/cockroach/blob/f371c58a07cda6f6b0e4bb7d5c42a8b034050677/pkg/workload/tpcc/delivery.go#L97
We see from the vmodule logs that the transaction was, in fact, involved in a deadlock:
I240410 19:19:36.163168 4735693 kv/kvserver/txnwait/queue.go:757 ⋮ [T1,Vsystem,n2,s2,r109/2:‹/Table/11{1/1/3/7/…-2/1}›] 293522 ‹18ca5f06› breaking deadlock by force push of ‹2f49b32d›; dependencies=[‹2f49b32d› ‹18ca5f06›]
I240410 19:19:36.163189 4735693 kv/kvserver/txnwait/queue.go:1014 ⋮ [T1,Vsystem,n2,s2,r109/2:‹/Table/11{1/1/3/7/…-2/1}›] 293523 force pushing ‹2f49b32d-29fa-41a6-a260-877d1c1e3875› to break deadlock
I240410 19:19:36.163426 4735693 kv/kvserver/batcheval/cmd_push_txn.go:253 ⋮ [T1,Vsystem,n2,s2,r109/2:‹/Table/11{1/1/3/7/…-2/1}›] 293527 18ca5f06 pushed (push type=PUSH_ABORT) 2f49b32d: forced push (pushee last active: 1712776776.006892555,0)
We don't know much about the other transaction in the deadlock cycle, but we do see:
I240410 19:19:36.156243 3650212 kv/kvserver/batcheval/cmd_push_txn.go:304 ⋮ [T1,Vsystem,n2,s2,r109/2:‹/Table/11{1/1/3/7/…-2/1}›] 293504 failed to push meta={id=18ca5f06 key=/Table/111/1/‹4›/‹1› iso=Serializable pri=0.00930210 epo=0 ts=1712776775.999011367,0 min=1712776775.999011367,0 seq=0} lock=true stat=PENDING rts=0,0 wto=false gul=0,0
I240410 19:19:36.163063 4735693 kv/kvserver/txnwait/queue.go:723 ⋮ [T1,Vsystem,n2,s2,r109/2:‹/Table/11{1/1/3/7/…-2/1}›] 293519 pusher was updated: "sql txn" meta={id=18ca5f06 key=/Table/111/1/‹4›/‹1› iso=Serializable pri=0.00930210 epo=0 ts=1712776775.999011367,0 min=1712776775.999011367,0 seq=30} lock=true stat=PENDING rts=1712776775.999011367,0 wto=false gul=1712776776.499011367,0
So the other transaction is Serializable. Based on the transaction's key, it also appears to be a delivery
transaction.
One thought is that we may be seeing deadlocks due to the best-effort nature of SELECT FOR UPDATE locks acquired under Serializable isolation. This could allow an unreplicated lock to be lost (e.g. one acquired by selectNewOrder
), leading to a lock ordering inversion.
I'll switch enable_durable_locking_for_serializable
to true
and see if I can reproduce.