cockroach icon indicating copy to clipboard operation
cockroach copied to clipboard

kv: limit retry count in db.Txn retry loop

Open nvanbenschoten opened this issue 1 year ago • 3 comments

Fixes #113941.

This commit adds a new kv.transaction.internal.max_auto_retries cluster setting, which controls the maximum number of auto-retries a call to kv.DB.Txn will perform before aborting the transaction and returning an error. This is used to prevent infinite retry loops where a transaction has little chance of succeeding in the future but continues to hold locks from earlier epochs.

The setting defaults to a value of 100 retries.

There are two primary benefits to this change:

  1. internal transactions (those run by jobs, migrations, etc.) which use the kv.DB.Txn API will no longer risk holding locks indefinitely across an unbounded number of epochs if they get stuck in a retry loop scenario. Eventually, they will get aborted and the locks will be released. This will unwedge other transactions in the system that are waiting for locks, limiting the blast radius of a stuck transaction and preventing an outage caused by it from persisting indefinitely.
  2. internal transactions will no longer get stuck in unobservable infinite retry loops where logging exists above the call to kv.DB.Txn, but no logging exists below it. In these cases, we previously struggled to identify the retry loop (who was stuck retrying) or diagnose its cause (why were they stuck retrying). By hitting a retry limit and returning an error up the stack, we will help debuggers answer both of these questions.

Release note: None

nvanbenschoten avatar Feb 21 '24 21:02 nvanbenschoten

This change is Reviewable

cockroach-teamcity avatar Feb 21 '24 21:02 cockroach-teamcity

TFTRs!

bors r+

nvanbenschoten avatar Feb 24 '24 00:02 nvanbenschoten

Build failed (retrying...):

craig[bot] avatar Feb 24 '24 00:02 craig[bot]

Looks like CI is red on this one.

bors r-

yuzefovich avatar Feb 24 '24 01:02 yuzefovich

Canceled.

craig[bot] avatar Feb 24 '24 01:02 craig[bot]

Looks like CI is good now, I'll re-engage bors. Enjoy skiing @nvanbenschoten!

bors r+

arulajmani avatar Feb 24 '24 16:02 arulajmani

Build succeeded:

craig[bot] avatar Feb 24 '24 17:02 craig[bot]

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 79219f705fc76eae5753ea5e28b7077dfacfd213 to blathers/backport-release-23.2-119482: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.2.x failed. See errors above.


:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

blathers-crl[bot] avatar Feb 24 '24 17:02 blathers-crl[bot]