deepTools icon indicating copy to clipboard operation
deepTools copied to clipboard

enahncement of multibamsummary performance

Open LeilyR opened this issue 2 years ago • 1 comments

It is slow when the numebr of bam files increases. (It could potentially be affected by the high depth of sequencing as well.)

LeilyR avatar May 06 '22 12:05 LeilyR

I've just finished reading about Dask, h5py and other alternatives to NumPy.

At first, I thought that one of them could speed up the step of writing data to disk. According to this benchmark, I was wrong.

Probably the need to change NumPy is related then to other functions (matrix operations, algebra, etc) that should be parallelized.

Of the options evaluated, Dask seems promising, but it would entail rewriting more modules because some currently used NumPy functionalities used in deeptools are out of their scope.

Seems to me that h5py is a drop-in replacement to NumPy, it has every data type except for generic objects. Correct me if I'm wrong, we're not using dtype "O".

So, h5py could be the way to go. I will try that. Now, even if it speeds things up, it would be safe to have a more thorough test suite. Specially over those modules that rely the most on NumPy functions. Even if we don't introduce changes in them, they could be affected. Here's a list of the python modules and the count of NumPy calls (actually, the number of lines that match np\\. regex...)

❯ rg -c np\\. *.py  | awk -F ':' '{print $2 "\t" $1}' | sort -rn
86      heatmapper.py
60      plotProfile.py
47      correlation.py
46      plotFingerprint.py
45      plotHeatmap.py
38      getFragmentAndReadSize.py
22      computeMatrixOperations.py
22      computeGCBias.py
21      countReadsPerBin.py
15      SES_scaleFactor.py
15      plotEnrichment.py
14      correctGCBias.py
12      plotCoverage.py
12      heatmapper_utilities.py
11      getScorePerBigWigBin.py
[truncated]

If greater data is need for the tests, this can be sorted out with git-lfs.

adRn-s avatar May 09 '22 15:05 adRn-s