cockroach icon indicating copy to clipboard operation
cockroach copied to clipboard

sql: retry validation of unique constraint

Open yuzefovich opened this issue 3 years ago • 3 comments
trafficstars

Previously, when running the validation of unique constraint we would execute the query only once, and if that fails for any reason, we'd return an error. This was suboptimal because we might be performing the validation after having performed a lot of work (e.g. running an import), and the error could be a transient one (e.g. if a node that is used in the distributed validation query goes down). In order to improve the situation this commit adds a retry loop with 5 attempts with an allowlist of error that are swallowed.

Fixes: #85711.

Release note: None

yuzefovich avatar Sep 20 '22 22:09 yuzefovich

This change is Reviewable

cockroach-teamcity avatar Sep 20 '22 22:09 cockroach-teamcity

I suppose this is fine. We already have logic in the legacy schema changer to retry transient errors:

https://github.com/cockroachdb/cockroach/blob/d106bd6525e547372cb96c9082a22b19ee966bde/pkg/sql/schema_changer.go#L2627-L2631

It seems that here we're in the importer, so I guess it makes sense. Does import not have checkpoints? What would happen if we just marked the failure as retryable using the below function and let the job registry deal with retrying? https://github.com/cockroachdb/cockroach/blob/1e322dc6b79633dd90a59e9880b9b224da52705d/pkg/jobs/errors.go#L31-L33

ajwerner avatar Sep 22 '22 02:09 ajwerner

I prefer that, if it works, because the user will get visibility, there will be backoff, and we'll eventually succeed.

ajwerner avatar Sep 22 '22 02:09 ajwerner

@dt can you chime in here? I know sql teams own import or whatever, but let's be real, you know the answers and we don't. I do see progress checkpointing here: https://github.com/cockroachdb/cockroach/blob/b8eee78e455ad97f85f6a74c22732d3c5f3ee349/pkg/sql/importer/import_processor_planning.go#L158-L179

ajwerner avatar Sep 23 '22 03:09 ajwerner

I agree, it does look like the import is doing the check-pointing, but there are other users of validateUniqueConstraint which might not - e.g. it is used at the end of the backfill and during restore. Is it worth relying on the caller to retry? To me the answer seems no, but I'm not familiar with these callers to have a strong opinion about it. #85711 indicates that either the import doesn't do a good job of retrying itself or the test harness doesn't.

yuzefovich avatar Sep 30 '22 19:09 yuzefovich

TFTRs!

bors r+

yuzefovich avatar Sep 30 '22 20:09 yuzefovich

Build succeeded:

craig[bot] avatar Sep 30 '22 22:09 craig[bot]

blathers backport 22.2

yuzefovich avatar Sep 30 '22 22:09 yuzefovich