dss icon indicating copy to clipboard operation
dss copied to clipboard

Mitigate contention with dense, high-tempo operations

Open BenjaminPelletier opened this issue 3 years ago • 0 comments

When prober's scd/test_operation_simple_heavy_traffic_concurrent.py test runs on a real-world cross-data-center distributed CRDB cluster with 100 concurrent operations, often 1 operation mutation will fail (as many as 3 failures observed) with one of the contention-type errors, usually ABORT_REASON_PUSHER_ABORTED.

Even when the number of concurrent operations is reduced to 40, it appears that a failure is sometimes observed even with 50 retries. This is not yet verified as the failure observation was made before #740 was merged, and #740 is what confirms the 50 retries.

While even 10-20 concurrency support should serve all foreseeable medium-term deployments, we should understand this failure better and identify a mitigation to enable future scaling. In the short term, we should reduce the number of concurrent operations to better align test acceptance criteria to current product needs (reducing concurrent operation count). In the long term, we should mitigate the issue and restore the higher concurrency limit.

BenjaminPelletier avatar Mar 16 '22 22:03 BenjaminPelletier