datajoint-python
datajoint-python copied to clipboard
Provide a backup and restore utility for DataJoint pipelines
Feature Request
Problem
Currently, users (along with admins) do not have a simple, intuitive means to perform restricted backups and restore operations. Workarounds typically place a large burden on user to parse the pipeline or involve server-side support. A possible solution could be to define methods such as:
dj.backup(backup_root_path, table)
This would define a working directory for the backup and a table as the 'anchor' for the backup. The table may have a restriction condition to restricts the records in table and its descendants. With these records, the method would determine all the child and parent dependencies (along with any forks resulting from Master-Part relationships). Once all records in lineage are associated with table, they would be read and compressed into an appropriate file format e.g. HDF5, NPZ, Parquet, etc. Additionally, a restore.py script could be written that specifies the DataJoint table classes with a last step to decompress and ingest the resulting backup.
dj.restore(backup_root_path, database_prefix, connection=None)
This would define a working directory and a namespace (i.e. database_prefix) under which to 'load' all of the backup data into. Specifying connection would set the target server location but default to dj.conn() if set to None.
These 2 routines also provide the mechanisms for exporting/publishing data from any given DataJoint pipeline.
Requirements
- Create a compressed representation of a DataJoint pipeline that can be restricted to a particular subset in origin
- The saved data must be self-describing and accessible by standard tools
- Load data into a target database server under a specific schema prefix
- Loading must work if the data is already partially loaded, allowing for simple synchronization of new data.
- Maintain comparable (or better) performance to 70% of
mysqldump's runtime.
Justification
- Exposes functionality to typical user looking to 'copy' a pipeline as a local workable version.
- Allows DataJoint to provide admin level functionality to provide a means to automate backups, define disaster relief processes, etc.
- Provides an additional method for sharing the data outside the data pipeline.
Alternative Considerations
Current workaround for this involves manual routines by user or server side support via (mysqldump, volume-based backups). Both present significant challenges for typical user.
Additional Research and Context
Related to (and potentially supercedes) #560.
Added HDF5 as one of the formats to consider.
For dj.backup, restriction can applied to the table object already. Removing the restriction argument.
We would need to provide a finer control of what is to be included in the backup. We could supply a dj.Diagram object in dj.backup. These objects already allow addition, subtraction, difference and overlap to precisely control the table set.
@dimitri-yatsenko Excellent! Thanks for the input.
Regarding supplying as a query expression (i.e. already restricted). Would it be clear which table the records/result specifically belong to? The idea was to essentially trace the relations of a subset of records from a specific table.
Regarding the diagram object, would we be able to indicate with this object that not all records from a table are to be exported/backed up? E.g. only records related today's experimental session (including ancestors/descendents).
@guzman-raphael A restriction in its most general form must be applied to a specific table but a restricted table can serve as a restriction to its descendants and ancestors. So I would specify (a) a dj.Diagram and (b) zero or more restricted tables from the diagram to use for restricting the records in their descendants.
@dimitri-yatsenko perfect! This will work :)
Merging with #560