deepTools
deepTools copied to clipboard
enahncement of multibamsummary performance
It is slow when the numebr of bam files increases. (It could potentially be affected by the high depth of sequencing as well.)
I've just finished reading about Dask, h5py
and other alternatives to NumPy.
At first, I thought that one of them could speed up the step of writing data to disk. According to this benchmark, I was wrong.
Probably the need to change NumPy is related then to other functions (matrix operations, algebra, etc) that should be parallelized.
Of the options evaluated, Dask seems promising, but it would entail rewriting more modules because some currently used NumPy functionalities used in deeptools are out of their scope.
Seems to me that h5py
is a drop-in replacement to NumPy, it has every data type except for generic objects. Correct me if I'm wrong, we're not using dtype "O".
So, h5py
could be the way to go. I will try that. Now, even if it speeds things up, it would be safe to have a more thorough test suite. Specially over those modules that rely the most on NumPy functions. Even if we don't introduce changes in them, they could be affected. Here's a list of the python modules and the count of NumPy calls (actually, the number of lines that match np\\.
regex...)
❯ rg -c np\\. *.py | awk -F ':' '{print $2 "\t" $1}' | sort -rn
86 heatmapper.py
60 plotProfile.py
47 correlation.py
46 plotFingerprint.py
45 plotHeatmap.py
38 getFragmentAndReadSize.py
22 computeMatrixOperations.py
22 computeGCBias.py
21 countReadsPerBin.py
15 SES_scaleFactor.py
15 plotEnrichment.py
14 correctGCBias.py
12 plotCoverage.py
12 heatmapper_utilities.py
11 getScorePerBigWigBin.py
[truncated]
If greater data is need for the tests, this can be sorted out with git-lfs.