poretools
poretools copied to clipboard
Fast5 zip
This patch allows a .zip file (of fast5 files) to be processed efficiently and in the same way that the tarball of fast5 files are processed.
I have found the .zip to have several key advantages over a tarball of fast5 files:
- There is about 30% space saved in my data sets
- the .zip file is indexed, and efficient to extract a single file, so there is no need to extract the entire run or parse the whole archive
- the .zip file can be read in parallel (using multithread.Pool)
- the .zip has a checksums to protect against data corruption
- extracting zip files on a network drive (GPFS) is about 5-10x faster than working with each file individually, so long as the extraction is saved locally (/tmp) or to memory (python's ZipFile.read() or /dev/shm)
Please consider incorporating this into poretools, as I believe that zips are a workable way forward to managing the millions of fast5 files that are presently being generated per run.