atlasdb icon indicating copy to clipboard operation
atlasdb copied to clipboard

[TimeLock Serialisation] Phase 3b2: Lethargy of the Relentless Dusk (Client Side)

Open jeremyk-91 opened this issue 3 years ago • 1 comments

General

Before this PR: It isn't actually feasible to wire up atlasdb proxy to autobatch its unlock requests

After this PR:

==COMMIT_MSG== Unlock requests across multiple namespaces in atlasdb proxy can be batched by wiring up user config ==COMMIT_MSG==

Priority: P2 ($)

Concerns / possible downsides (what feedback would you like?): User latencies might increase by a bit (but this is order of few millis, not particularly high)

Is documentation needed?: Not really

Compatibility

Does this PR create any API breaks (e.g. at the Java or HTTP layers) - if so, do we have compatibility?: No

Does this PR change the persisted format of any data - if so, do we have forward and backward compatibility?: No

The code in this PR may be part of a blue-green deploy. Can upgrades from previous versions safely coexist? (Consider restarts of blue or green nodes.): Yes - they just won't use the new endpoint

Does this PR rely on statements being true about other products at a deployment - if so, do we have correct product dependencies on these products (or other ways of verifying that these statements are true)?: Timelock to be at least a given version so that the batch endpoint is supported. But this is only true on atlasdb proxy and not on services in general, so this is not the right place to add a product dependency.

Does this PR need a schema migration? No.

Testing and Correctness

What, if any, assumptions are made about the current state of the world? If they change over time, how will we find out?: Not much.

What was existing testing like? What have you done to improve it?: I'm relying on the existing tests for the refactored path for Legacy. The more complex implementation already has tests for the most part and will be ETE'd in atlasdb-proxy.

If this PR contains complex concurrent or asynchronous code, is it correct? The onus is on the PR writer to demonstrate this.: No such code

If this PR involves acquiring locks or other shared resources, how do we ensure that these are always released?: No

Execution

How would I tell this PR works in production? (Metrics, logs, etc.): Metrics showing usage of new server-side endpoint

Has the safety of all log arguments been decided correctly?: No new logging args

Will this change significantly affect our spending on metrics or logs?: No

How would I tell that this PR does not work in production? (monitors, etc.): Standard failure monitors.

If this PR does not work as expected, how do I fix that state? Would rollback be straightforward?: Rollback.

If the above plan is more complex than “recall and rollback”, please tag the support PoC here (if it is the end of the week, tag both the current and next PoC):

Scale

Would this PR be expected to pose a risk at scale? Think of the shopping product at our largest stack.: No.

Would this PR be expected to perform a large number of database calls, and/or expensive database calls (e.g., row range scans, concurrent CAS)?: No.

Would this PR ever, with time and scale, become the wrong thing to do - and if so, how would we know that we need to do something differently?: No, I don't think so.

Development Process

Where should we start reviewing?: NamespacedLockTokenUnlocker. In general, develop or regain familiarity with the existing infrastructure for cross-namespace batching, and then check that this checks the right boxes.

If this PR is in excess of 500 lines excluding versions lock-files, why does it not make sense to split it?: No.

Please tag any other people who should be aware of this PR: @jeremyk-91 @sverma30 @raiju

jeremyk-91 avatar Aug 10 '22 18:08 jeremyk-91

Yeah, smoke testing in progress. Can add one or two for the wiring

jeremyk-91 avatar Aug 11 '22 13:08 jeremyk-91

Released 0.685.0

svc-autorelease avatar Aug 16 '22 16:08 svc-autorelease