codd icon indicating copy to clipboard operation
codd copied to clipboard

Coordinate multiple Apps deployments

Open mzabani opened this issue 3 years ago • 1 comments

When deploying several different Apps that use the same DB instance, it would be very useful to run codd in all of them instead of orchestrating a previous independent codd up step. Ideally, codd up would work like this:

  • Only one instance of codd would run the migrations while the others wait, so no duplicated work occurs
  • If there are pending no-txn migrations, it is possible that the instances of codd that had been waiting run on an empty set of pending migrations and then run a hard check, while the migrations runner instance runs a soft one, leading to inconsistent behavior when checksums mismatch.
    • Skipping checksums verification in such cases might be a possibility, albeit a possibly dangerous one?
  • More importantly, when the migrations runner instance is ready to commit, it is possible other apps aren't up yet, so it might want to coordinate with others before beginning.
    • This could require users to register "instance names", otherwise how could we know there are others to wait for?

Let's try to cover an "advanced scenario": N apps with the same DB being deployed, and we want it blue-green. Migrations should only start running when all apps are warm, but before they are healthy (for health checks).

  1. All N instances are deployed and the one selected to run migrations waits for every other instance to be up and warm.
  • Maybe something like codd up-multi --wait-for-others Nsecs for the migrations runner and codd up-multi --wait-for-runner Nsecs ?
  • Detect when all instances ran with --wait-for-runner and fail with an exception in that case.
  • If more than one instance ran with --wait-for-others, fail or race them to select a winner? Racing them for a winner will fail in less occasions, but why would a user want this if they're likely going to set up logging infrastructure to watch over logs of a specific instance, for example? So it's most likely a mistake?
  1. When the runner knows everyone's up/warm, run the migrations, commit, verify etc. just like single-instance behavior.
  2. Instances with --wait-for-runner watch/poll the running process to make sure the runner is alive. 3.1. If the runner successfully commits and passes its verifications, the waiters should succeed too, which means they do not check checksums because the runner has either already checked them or has not - possible in case of no-txn migrations. In any case, if the runner succeeds, so should everyone else. 3.2. If the runner fails, everyone should also fail.

mzabani avatar Apr 02 '21 17:04 mzabani

This problem has further complications. Suppose an app is deployed very frequently. Say the app is deployed and its multiple instances of codd are waiting on each other or even running migrations, and another cluster is being deployed with new changes+migrations in the meantime (and yes, as soon as the newest cluster is deployed the first one will get shut down because it's outdated).

After much thought about how to work around this, I've decided to just document that codd can't handle simultaneous cluster deployments.

mzabani avatar Sep 17 '21 20:09 mzabani