cockroach icon indicating copy to clipboard operation
cockroach copied to clipboard

roachtest: tpcc-nowait/isolation-level=mixed/nodes=3/w=1 failed

Open cockroach-teamcity opened this issue 1 year ago • 1 comments

roachtest.tpcc-nowait/isolation-level=mixed/nodes=3/w=1 failed with artifacts on master @ 04f0416d526a43741d22fd03966758dcccdeb79f:

(monitor.go:153).Wait: monitor failure: full command output in run_115730.097711628_n4_cockroach-workload-r.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/tpcc-nowait/isolation-level=mixed/nodes=3/w=1/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/test-eng

This test on roachdash | Improve this report!

cockroach-teamcity avatar Feb 22 '24 12:02 cockroach-teamcity

TransactionAbortedError when running workload with --isolation-level=read_uncommitted --txn-retries=false:

Error: error in delivery: ERROR: restart transaction: TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT_REASON_PUSHER_ABORTED): "sql txn" meta={id=d757695b key=/Table/111/1/2/1 iso=ReadCommitted pri=0.00665952 epo=0 ts=1708603167.770875286,0 min=1708603167.585376790,0 seq=0} lock=true stat=ABORTED rts=1708603167.770875286,0 wto=false gul=1708603168.085376790,0 (SQLSTATE 40001)

Pinging @cockroachdb/kv.

renatolabs avatar Feb 22 '24 20:02 renatolabs

@arulajmani can you please take a look ?

shralex avatar Feb 26 '24 08:02 shralex

I'm removing the release blocker label here. We're running this test without retry loops. The test failed because a pusher noticed someone aborted its transaction -- I'll see if I can figure out why this happened and report back, but this isn't the kind of thing we'd block releases over.

arulajmani avatar Feb 27 '24 01:02 arulajmani

roachtest.tpcc-nowait/isolation-level=mixed/nodes=3/w=1 failed with artifacts on master @ 455b16592df7d8efd121b3ba1256fb477e227564:

(monitor.go:154).Wait: monitor failure: full command output in run_110849.001451613_n4_cockroach-workload-r.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/tpcc-nowait/isolation-level=mixed/nodes=3/w=1/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=false
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

cockroach-teamcity avatar Mar 14 '24 11:03 cockroach-teamcity

This last failure was different:

Error: error in newOrder: ERROR: internal error: unexpected error from the vectorized engine: null_value_error: non-nullable column "customer:c_last" with no value! Index scanned was "customer_pkey" with the index key columns (c_w_id,c_d_id,c_id) and the values (?,?,?)

cc @cockroachdb/sql-queries

renatolabs avatar Mar 14 '24 14:03 renatolabs

Thanks @renatolabs, that's concerning. I'm going to mark as a GA-blocker until we know more.

mgartner avatar Mar 15 '24 19:03 mgartner

Thanks @renatolabs, that's concerning. I'm going to mark as a GA-blocker until we know more.

Actually, I suspect there's a good chance this might be part of yesterday's bad run. @stevendanna?

renatolabs avatar Mar 15 '24 20:03 renatolabs

roachtest.tpcc-nowait/isolation-level=mixed/nodes=3/w=1 failed with artifacts on master @ ec9444f8287663777c5ebceff61bcd280491ba08:

(monitor.go:154).Wait: monitor failure: full command output in run_110920.314162169_n4_cockroach-workload-r.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/tpcc-nowait/isolation-level=mixed/nodes=3/w=1/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=false
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

cockroach-teamcity avatar Mar 27 '24 11:03 cockroach-teamcity

roachtest.tpcc-nowait/isolation-level=mixed/nodes=3/w=1 failed with artifacts on master @ 7fc4c7bcbbf0c75a62d056da0bf79a5a32714650:

(monitor.go:154).Wait: monitor failure: full command output in run_110656.692822069_n4_cockroach-workload-r.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/tpcc-nowait/isolation-level=mixed/nodes=3/w=1/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=false
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

cockroach-teamcity avatar Apr 01 '24 11:04 cockroach-teamcity

roachtest.tpcc-nowait/isolation-level=mixed/nodes=3/w=1 failed with artifacts on release-24.1 @ d141d2c700eb1e323f97e15831c3b9915aba9248:

(monitor.go:154).Wait: monitor failure: full command output in run_114112.092215343_n4_cockroach-workload-r.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/tpcc-nowait/isolation-level=mixed/nodes=3/w=1/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=false
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

cockroach-teamcity avatar Apr 06 '24 11:04 cockroach-teamcity

In each of these cases, we are seeing a Read Committed transaction get aborted with a TransactionAbortedError(ABORT_REASON_PUSHER_ABORTED) error during the delivery transaction. I'm wondering whether this is a consequence of a lock ordering deadlock and deadlock detection kicking in. https://github.com/cockroachdb/cockroach/pull/122051 will help us to determine this.

nvanbenschoten avatar Apr 09 '24 20:04 nvanbenschoten

I was able to reproduce with the new workload and cockroach logging.

We again see a transaction get aborted in the delivery transaction.

Error: error in delivery: select new_order failed: ERROR: restart transaction: TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT_REASON_PUSHER_ABORTED): "sql txn" meta={id=2f49b32d key=/Table/111/1/4/1 iso=ReadCommitted pri=0.00224044 epo=0 ts=1712776776.055651390,0 min=1712776776.006892555,0 seq=9} lock=true stat=ABORTED rts=1712776776.055651390,0 wto=false gul=1712776776.506892555,0 (SQLSTATE 40001)

This time, we know it's while executing this statement: https://github.com/cockroachdb/cockroach/blob/f371c58a07cda6f6b0e4bb7d5c42a8b034050677/pkg/workload/tpcc/delivery.go#L97

We see from the vmodule logs that the transaction was, in fact, involved in a deadlock:

I240410 19:19:36.163168 4735693 kv/kvserver/txnwait/queue.go:757 ⋮ [T1,Vsystem,n2,s2,r109/2:‹/Table/11{1/1/3/7/…-2/1}›] 293522  ‹18ca5f06› breaking deadlock by force push of ‹2f49b32d›; dependencies=[‹2f49b32d› ‹18ca5f06›]
I240410 19:19:36.163189 4735693 kv/kvserver/txnwait/queue.go:1014 ⋮ [T1,Vsystem,n2,s2,r109/2:‹/Table/11{1/1/3/7/…-2/1}›] 293523  force pushing ‹2f49b32d-29fa-41a6-a260-877d1c1e3875› to break deadlock
I240410 19:19:36.163426 4735693 kv/kvserver/batcheval/cmd_push_txn.go:253 ⋮ [T1,Vsystem,n2,s2,r109/2:‹/Table/11{1/1/3/7/…-2/1}›] 293527  18ca5f06 pushed (push type=PUSH_ABORT) 2f49b32d: forced push (pushee last active: 1712776776.006892555,0)

We don't know much about the other transaction in the deadlock cycle, but we do see:

I240410 19:19:36.156243 3650212 kv/kvserver/batcheval/cmd_push_txn.go:304 ⋮ [T1,Vsystem,n2,s2,r109/2:‹/Table/11{1/1/3/7/…-2/1}›] 293504  failed to push meta={id=18ca5f06 key=/Table/111/1/‹4›/‹1› iso=Serializable pri=0.00930210 epo=0 ts=1712776775.999011367,0 min=1712776775.999011367,0 seq=0} lock=true stat=PENDING rts=0,0 wto=false gul=0,0
I240410 19:19:36.163063 4735693 kv/kvserver/txnwait/queue.go:723 ⋮ [T1,Vsystem,n2,s2,r109/2:‹/Table/11{1/1/3/7/…-2/1}›] 293519  pusher was updated: "sql txn" meta={id=18ca5f06 key=/Table/111/1/‹4›/‹1› iso=Serializable pri=0.00930210 epo=0 ts=1712776775.999011367,0 min=1712776775.999011367,0 seq=30} lock=true stat=PENDING rts=1712776775.999011367,0 wto=false gul=1712776776.499011367,0

So the other transaction is Serializable. Based on the transaction's key, it also appears to be a delivery transaction.

nvanbenschoten avatar Apr 10 '24 19:04 nvanbenschoten

One thought is that we may be seeing deadlocks due to the best-effort nature of SELECT FOR UPDATE locks acquired under Serializable isolation. This could allow an unreplicated lock to be lost (e.g. one acquired by selectNewOrder), leading to a lock ordering inversion.

I'll switch enable_durable_locking_for_serializable to true and see if I can reproduce.

nvanbenschoten avatar Apr 10 '24 20:04 nvanbenschoten