cht-core Migrations won't run on not yet replicated docs

Currently migrations run on the server when api starts up and then they get marked as completed and they're never run again. This worked fine with always online clients, but now we use offline first we cannot guarantee all the docs are synced when the migration is run.

Work out how to migrate docs after the upgrade.

Mar 03 '16 19:03 garethbowen

A couple of ideas: we could either run migrations on client side when the ddoc is updated, or on the server side on replication.

Mar 03 '16 19:03 garethbowen

Running server side on replicate: we'll never know when a client is up to date which means we'll have to run each migration on each replicated document for all eternity. This would also require a rewrite of how we do migrations to deal with one doc at a time.

Running client side on update: this might slow down the handset which the migration is running. We have to execute migrations before setting up db sync which means using the changes feed since last sync to determine which docs have been changed locally, then migrating them in the bootstrap, before launching the app. This also requires a rewrite of how we do migrations to deal with one doc at a time. It also means working out how to distribute migrations to the client.

I think of the two options, running on client side is the lesser of two evils, because the migrations will run once on every device and then never again.

Mar 17 '16 02:03 garethbowen

Dave pointed out the pouchdb-migrate plugin which could be useful. One of the options they allow is since which means we may be able to...

replicate from server
run any migrations since last synced sequence
mark migration as complete
replicate to server

Mar 22 '16 03:03 garethbowen

We can use this pouchdb plugin to get all the unsynced docs to know which we need to migrate before push.

Mar 22 '16 22:03 garethbowen

Ok... I think we can put this off until we need it, but here's the plan:

Create a new style of migration which can run on one doc at a time, ignoring docs that aren't relevant.
On the server get all docs and run the migration on each one saving those that have been modified.
Store the migrations on the ddoc somewhere.
On the client fetch the ddoc in the background and check for any new migrations. Block replication to the remote db until migrations are finished.
Use the unsynced docs plugin (or similar) to find the local modifications since last sync.
Use the pouchdb-migrate plugin (or similar) to migrate all the docs found above.
Mark the migration as complete and unblock replication to the remote db.

Mar 23 '16 01:03 garethbowen

There are cons to client-side migration as well unfortunately.

If migrations only run on the server side it may be possible to get new app code replicated to you without also having all migrated documents being replicated to you, and it is very possible that the migration will run while there are documents that only exist on the client side and are thus missed by that migration run
If migrations only run on the client side you have to wait until all clients have run the migration and synced for the server side representation to be accurate, which could cause problems for any code that runs server side that relies on the migration (sentinel, something that analytics expects etc).
BUT If they are run on both we'll get conflicts up the wazoo (technical term) and will need a strategy to resolve those conflicts (which, arguably, we should have anyway).

This leads me to think about two changes:

Make sure that all our code is, at least for a time, backwards compatible with non-migrated data. This could be anything from supporting both (eg if you move data check in both places) to just recognising something is not migrated and telling the user they need to sync.
Introduce a style of migration that are like sentinel transitions, that on their introduction to a server start at change 0, and are run forever over all changes. If the migration you are writing can be more efficient in bulk (ie checking a view, bulk changing documents), we could potentially support both, a firstRun(db) fn that does all that, and filter(db, change) and migrate(db, change) that is run over every change.

Pros:

I believe it fixes all the concerns I'm aware of
Doesn't require smart phones to run potentially heavy migrations
Potentially makes migration code simpler (ie there is a framework where you're given a doc and you change it or you don't, finding that doc, saving it etc is managed for you)

Cons:

Potentially makes migration code more complex (ie maybe you write two, maybe said framework constrains you)
Definitely makes application code more complex
The new "continual migration handler" will become slower and slower, because now for every change you have to run a (no IO command hopefully) filter for every migration to see if it needs to run.

Thoughts @garethbowen ? Sorry for brain-dumping on your ticket…

Mar 22 '17 10:03 SCdF

@SCdF No worries - two brains dumping are better than one.

I don't much like your proposed solution because as you say, it needs to run on every doc forever, even if there's no way it could possibly make a difference (eg: 0.4 migrations running for projects that started on 2.11).

Let's voice chat about this some time - I feel like we're close to a solution.

Mar 23 '17 02:03 garethbowen

OK so @garethbowen and I had a chatty chat.

Here is the current apogee of our collective thought, intermingled with my post collective thought confusion:

Client-side migrations probably do not help
We should have a separate metadata document for each document. This document is server-side only, and stores the schema version of the document (as it is server side only it could store other things in the future that we have no interest in replicating client-side, for example initial replication date)
There are a collection of migrations. They are versioned in some numeric incremental way (date, number, whatever), and they are split by document type (Gareth, does that matter anymore if they are server side? Would the optimisation be gating by type vs. gating by a filter fn per migration, and we care for simplicity, performance?)
(Gareth, are there situations where we'd want to support blocking and non-blocking migrations? For example, if we're going through each data_record and fixing incorrect dates we wouldn't want to block people replicating / using the app while we do this right? Because all that is happening is the data is going from wrong (like now) to eventually right)
When api boots with new migrations (ie app upgrade) api blocks on these migrations running up the current seq _(Gareth, somehow api blocks on sentinel?… and I guess we'd potentially want to write two versions of the migration, one that uses bulk docs / views for speed, and one that deals with individual docs for changes. For most migrations that should be pretty natural anyway, you expose a fn that takes a doc and modifies it in place but doesn't write anything, which is used directly for both changes and the initial bulk wrapper)
(Gareth, we might want api to actually show some maintenance page for web users, and return a certain HTTP code for replicators, so we know we are in this state?)
Once everything is migrated api starts completely, and so clients can now replicate again
Once a client replicates a new DDOC down, we have a couple of options we are flicking back and forth between for server->client replication:
- We block any more replication down, or, we change API so you can only replicate documents down that have the same or lower schema version that what your current DDOC supports
- We just let it replicate down, whatever
Regardless, client-server replication continues
When sentinel gets a new doc, it checks its document type and its schema version, and it can use that to determine what migrations need to be run on it, if any.
- If the document gets changed by any of these migrations, the document is saved (and so replicated to the client)
- Regardless, the metadata document is updated to the latest version

Things that are still gross / unknown:

If there are no client-side migrations, it is impossible (without implementing transactions in some gross, convoluted way) to remove edge cases where the code that is being run is against incompatible data, without maintaining a large amount of both forwards and backwards compatibility in our structures, which is gross and complex
If there are client-side migrations there is no nice way to avoid lots of gross conflicts, and we currently have nothing in place to deal with migrations

Gareth: did I get that right?

Mar 27 '17 11:03 SCdF

@SCdF

does [documents are split by document type] matter anymore if they are server side? Would the optimisation be gating by type vs. gating by a filter fn per migration, and we care for simplicity, performance?

It doesn't matter much in practice but I think it matters conceptually. If we want to move to a world of data schemas then I think it makes sense to have versioned schemas per type. If want to stay with schema-less data then we should stick with a filter function. It probably falls outside the scope of this work - we can introduce schemas later.

are there situations where we'd want to support blocking and non-blocking migrations?

Yeah sure. I think @estellecomment was looking at this at one point. I think the majority of our migrations could be non-blocking. However given they're ordered, if the last migration is blocking then we have to block on all migrations until the last one is executed.

somehow api blocks on sentinel?

I'm not worried about where the code lives yet - the migrations might stay in api? This will be more clear once we've decided what we're doing...

I guess we'd potentially want to write two versions of the migration, one that uses bulk docs / views for speed, and one that deals with individual docs for changes.

I'd really rather not (complexity). However if you have the meta doc store the schema version (or whatever) and the doc type so you can deterministically work out if you need to run migrations on that doc you can have a view which returns all docs which should be run through a given migration. Then you can query that view for the first 100, run a batch through the migration map, bulk save, and query the view again. The query the view code could be written once in the migration runner, so all the migration writer has to do is write the mapping function.

we might want api to actually show some maintenance page for web users, and return a certain HTTP code for replicators, so we know we are in this state?

100% https://github.com/medic/medic-webapp/issues/2967

Mar 27 '17 20:03 garethbowen

Re: non-blocking migrations, no I don't remember doing anything about that...

I get the point of schema version. Cool. Not quite sure it all works out when version are server-only though, but maybe I'm missing some bits.

Assume there’s code v1, and schema v1, and we push an upgrade to code v2 and schema v2.

Server gets new code v2. Api blocks until migrations are run. Schema is now v2 for all docs. Api starts. Meanwhile, offline, a client on v1 edits an existing doc (editedDoc) and creates a new doc (newDoc).
Client gets online, and gets code v2. Gets changes for migrated docs. Conflict on editedDoc (what happens??). Client pushes its changes up to server.
Server gets changes for editedDoc and newDoc.
- newDoc has no schema version. Sentinel runs all migrations on it, then gives it a v2 schema version. All good.
- editedDoc
  - has a conflict, not sure what the change is?
  - meta doc says v2. Sentinel doesn’t know which version of code the client had when it made the change. Was the client already on v2? Was it still on v1? Was it on v0? Which migrations should sentinel run?
Changes are synced back down to client. Client gets editedDoc v2 and (if migration was necessary) newDoc v2.

Mar 28 '17 04:03 estellecomment

re: conflicts in general

Our managing of conflicts remains the same before and after this change, and your scenario above plays out identically (in terms of conflicts) with the current migration scheme. We currently have no conflict resolution, so CouchDB will just pick one based on which has the highest hash (or lowest, I forget, it's not relevant).

Conflicts are not auto-detected at any point in Couch / our app. So Sentinel will not be aware that there is a conflict, and neither will the client. You have to explicitly detect them and manage them. Using your example above, you will either randomly keep or randomly lose the client's changes, and never notice.

I think dealing with this is really important. However, it's just as important now as it would be after this change. If anything, you could attempt to argue it's slightly less important after this change, because having a metadata document that is only on the server side (where we could put schema version, transition history etc) may mean we can reduce the frequency that sentinel writes to client-facing documents.

Mar 28 '17 12:03 SCdF

@alxndrsn, please triage before the end of this sprint.

Jul 04 '17 07:07 nice-snek

Still needs doing.

Jul 04 '17 18:07 alxndrsn

Deprioritised out of 3.0.0.

May 14 '18 21:05 garethbowen

NB: in general I'd say we'd simply avoid doing this by not having massive migrations. However, it is likely that the flexible hierarchy work (#3639) will force us to migrate contacts, which in turn would be much easier to do if we solve this ticket.

Oct 01 '18 22:10 SCdF

(Up for discussion, but IMO unless @garethbowen is super duper sure we could leave this until we're sure that flexible hierarchy requires it)

Oct 01 '18 22:10 SCdF

As you say, this isn't required until we need to do a migration on data that can be changed on the phone (reports, places, or people). Flexible hierarchy is one feature that would really benefit from being able to efficiently and reliably migrate contact data so that all places have the same type. However we could make it work in a backwards compatible way so it's not technically required, it does however make the code much simpler and less error prone.

We have other examples of where our data structure is causing code complexity which need to be resolved eventually (messages vs reports, inlined contacts, etc) and these would also require efficient and reliable migrations. I think the best approach would be to bundle all these migrations together and solve this issue as part of a 4.0 release which will be some time away yet meaning that we don't need to solve this right now.

@alxndrsn @SCdF What do you think?

Oct 08 '18 20:10 garethbowen

@garethbowen I agree with this. FWIW I think this is a complex enough problem that we don't have time to solve it before 2019 anyhow, given other priorities.

Oct 09 '18 07:10 SCdF

FWIW I think this is a complex enough problem that we don't have time to solve it before 2019 anyhow, given other priorities.

:+1:

Oct 09 '18 10:10 alxndrsn

Removing from 3.2.0 as discussed.

Oct 10 '18 03:10 garethbowen