Data migration: switch over procedure plan

Open kpsherva opened this issue 1 year ago • 0 comments

To consider

Stop existing system (stop web nodes, stop cron tasks(
Drain queues: tasks, indexing, statistics.
Execute final migration tasks
Run checklist to confirm migration went smoothly
Switch over DNS to new system

Needs rollback procedure as well.

Questions

needs more info: do we mix methods on which migration "streams" are loaded from legacy? - ANswer: we care about completness first
decisions about invenio-sipstore
- currently used in Zenodo for restoring deleted records of a user (since it contains a FK reference to the user)
- the module is not added to rdm and not sure if it is compatible, do we want to put the effort or is it going to be replaced by something else? - Answer: this will be addressed in the next milestone, not a scope for now (now = MS5)
what should we do about "default" community while migrating multiple communities attached to a record?
- in zenodo we had no notion of a default.
- At the moment the first one on the list is set to default
- Should we investigate also the possibility of not having a default community? answer: probably we don't need a default, to be tested, if there is only one community, we set is as default, otherwise no branding for now
transactions: no added value right now? is more auditing, shall we just not migrate? answer: (about db transactions) we don't migrate, maybe we can remove from the current system.
how to deal with vocabularies?
- Some record fields depend on vocabularies that must be loaded before the record data
- Vocabularies are not a 1 to 1 mapping between Zenodo and InvenioRDM, sometimes we need aliases (and defaults) .
funding field: how to deal with it? Funders are loaded in the database and are referenced by grants.
licenses included to this question answer: (it has impact on legacy API depending on the decision, option 1: aliases, option 2 new ids, references updated on the fly during migration) it needs to be a valid identifier, we need to keep backwards compatibility in the legacy API , how the mapping is done to be discussed, side comment: probably faster to rely on the new vocabularies from the beginning.

revisions:

do we want to migrate all or just e.g. 2 latest and cleanup (not sure about recovery and SLAs, etc.) answer: all revisions, we kill drafts revisions, we keep only records revisions.
do we want to migrate all revisions or just copy and if we need to restore we run migration on that one revision. the second approach would make migration faster but might camouflage failures that might not be easy to recover later on (e.g. if we have missing data).

files:

Record files table metadata
- created, updated, version_id are set to the record's information as in legacy we don't have information about these i.e record._files **answer: to be verified **
- timestamps are missing on zenodo for files, for published record we can use the records' timestamps, might be safest to take from objectversion
conclusion: keep record uuid
conclusion: we might need to keep the additional intermediate db as a cache because we cannot rely on _files
conclusion: optimistic concurrency counter to be hardcoded at 1
some files found that are not part of buckets (260k files), so far we just transfer them, then we figure out the clean-up

Migration reports

### Tasks
- [ ] https://github.com/zenodo/zenodo-rdm/issues/194
- [ ] https://github.com/zenodo/zenodo-rdm/issues/251
- [ ] https://github.com/zenodo/zenodo-rdm/issues/262
- [ ] https://github.com/zenodo/zenodo-rdm/issues/258
- [ ] https://github.com/zenodo/zenodo-rdm/pull/257

Mar 30 '23 16:03 kpsherva

zenodo-rdm zenodo-rdm copied to clipboard

Data migration: switch over procedure plan

To consider

Questions

Migration reports

zenodo-rdm
zenodo-rdm copied to clipboard