datajoint-python Provide a backup and restore utility for DataJoint pipelines

trafficstars

Feature Request

Problem

Currently, users (along with admins) do not have a simple, intuitive means to perform restricted backups and restore operations. Workarounds typically place a large burden on user to parse the pipeline or involve server-side support. A possible solution could be to define methods such as:

dj.backup(backup_root_path, table)

This would define a working directory for the backup and a table as the 'anchor' for the backup. The table may have a restriction condition to restricts the records in table and its descendants. With these records, the method would determine all the child and parent dependencies (along with any forks resulting from Master-Part relationships). Once all records in lineage are associated with table, they would be read and compressed into an appropriate file format e.g. HDF5, NPZ, Parquet, etc. Additionally, a restore.py script could be written that specifies the DataJoint table classes with a last step to decompress and ingest the resulting backup.

dj.restore(backup_root_path, database_prefix, connection=None)

This would define a working directory and a namespace (i.e. database_prefix) under which to 'load' all of the backup data into. Specifying connection would set the target server location but default to dj.conn() if set to None.

These 2 routines also provide the mechanisms for exporting/publishing data from any given DataJoint pipeline.

Requirements

Create a compressed representation of a DataJoint pipeline that can be restricted to a particular subset in origin
The saved data must be self-describing and accessible by standard tools
Load data into a target database server under a specific schema prefix
Loading must work if the data is already partially loaded, allowing for simple synchronization of new data.
Maintain comparable (or better) performance to 70% of mysqldump's runtime.

Justification

Exposes functionality to typical user looking to 'copy' a pipeline as a local workable version.
Allows DataJoint to provide admin level functionality to provide a means to automate backups, define disaster relief processes, etc.
Provides an additional method for sharing the data outside the data pipeline.

Alternative Considerations

Current workaround for this involves manual routines by user or server side support via (mysqldump, volume-based backups). Both present significant challenges for typical user.

Additional Research and Context

Reference for NPZ files.
Reference for Parquet files.

Jan 25 '21 23:01 guzman-raphael

Related to (and potentially supercedes) #560.

Jan 26 '21 00:01 guzman-raphael

Added HDF5 as one of the formats to consider.

For dj.backup, restriction can applied to the table object already. Removing the restriction argument.

Jan 26 '21 02:01 dimitri-yatsenko

We would need to provide a finer control of what is to be included in the backup. We could supply a dj.Diagram object in dj.backup. These objects already allow addition, subtraction, difference and overlap to precisely control the table set.

Jan 26 '21 02:01 dimitri-yatsenko

@dimitri-yatsenko Excellent! Thanks for the input.

Regarding supplying as a query expression (i.e. already restricted). Would it be clear which table the records/result specifically belong to? The idea was to essentially trace the relations of a subset of records from a specific table.

Regarding the diagram object, would we be able to indicate with this object that not all records from a table are to be exported/backed up? E.g. only records related today's experimental session (including ancestors/descendents).

Jan 26 '21 02:01 guzman-raphael

@guzman-raphael A restriction in its most general form must be applied to a specific table but a restricted table can serve as a restriction to its descendants and ancestors. So I would specify (a) a dj.Diagram and (b) zero or more restricted tables from the diagram to use for restricting the records in their descendants.

Jan 26 '21 03:01 dimitri-yatsenko

@dimitri-yatsenko perfect! This will work :)

Jan 26 '21 03:01 guzman-raphael

Merging with #560

Oct 07 '22 20:10 dimitri-yatsenko

datajoint-python datajoint-python copied to clipboard

Provide a backup and restore utility for DataJoint pipelines

Feature Request

Problem

Requirements

Justification

Alternative Considerations

Additional Research and Context

datajoint-python
datajoint-python copied to clipboard