bin3C icon indicating copy to clipboard operation
bin3C copied to clipboard

Switch to HDF5 based storage of intermediate data types.

Open cerebis opened this issue 4 years ago • 0 comments

Currently data is stored simply compressing pickled python classes.

This approacj was chosen over other serialisation methods as a good-enough and quick approach. However, as time passes and the codebase evoles, class version dependency for existing serialised instances becomes increasingly problematic. This can prevent users wishing to go back to old data and reanalyse with newer version of the software, since the class cannot be deserialised.

Either we must provide conversions between class changes or better avoid this entirely.

Therefore, bin3C should switch to using a class-agnostic and efficient means of storing intermediate analysis results (contact map, clusterings). Though we could pickle plain datatypes, an obvious candidate is HDF5, which would introduce a chunk of dependencies itself. Another alternative is to consider adopting an existing Hi-C HDF5 format, so long as these do not themselves include external class implementation details or extraneous fields not relevant to metagenomics.

cerebis avatar Dec 15 '20 23:12 cerebis