tombo icon indicating copy to clipboard operation
tombo copied to clipboard

Size difference in per read statistics file when running `tombo detect_modifications model_sample_compare` with different `multiprocess-region-size` values

Open drivenbyentropy opened this issue 3 years ago • 2 comments

Hi,

During a benchmarking procedure to assess the best parameter set for the hardware environment I am running tombo in, I noticed that the per read statistics files result with a noticeably difference in file size depending on the choice for the multiprocessing window size (the number after the RERUN prefix denotes the window size, every other parameters was not changed).

6.6G Jun  4 01:24 barcode11.model_sample_compare_per_read.RERUN500.tombo.per_read_stats
6.6G Jun  4 01:22 barcode11.model_sample_compare_per_read.RERUN250.tombo.per_read_stats
6.8G Jun  4 01:21 barcode11.model_sample_compare_per_read.RERUN100.tombo.per_read_stats
7.1G Jun  4 03:03 barcode11.model_sample_compare_per_read.RERUN50.tombo.per_read_stats

Could you provide me with a brief explanation as to why this is happening? Does this imply slightly different results depending on the choice of this parameter? Any insight into this would be greatly appreciated.

Thank you in advance.

drivenbyentropy avatar Jun 04 '21 15:06 drivenbyentropy

The produced per-base (and per-read) results are identical no matter the multi-processing options selected.

The per-read statistics file is really an HDF5 file. If you are not familiar with HDF5 files, you can think of them as directories of files zipped together into a single file. An HDF5 file contains groups and datasets, which are like subdirectories and files in a zipped file.

Tombo implements multiprocessing by partitioning the genome into chunks, then allotting these chunks to different processes. Each chunk is stored in its own HDF5 group, so the multiprocessing options affect the file's layout.

Maybe the chunking causes wasted space or compression inefficiencies that affect the file size (just a guess).

SycamoreLeaf avatar Jun 04 '21 17:06 SycamoreLeaf

That's what I suspected, thank you for confirming.

drivenbyentropy avatar Jun 04 '21 17:06 drivenbyentropy