deepTools computeMatrix runs substantially slower than expected

Hi there, thank you for your work on deeptools!

I have multiple colleagues who use the computeMatrix (reference-point) and plotHeatmap tools extensively in their work. Yet, I'm noticing that computeMatrix takes substantially longer than I would expect given what the program is doing. For a typical bigWig file (~1 GB), a regions bedfile that contains ~100,000 ref-seq transcripts, computeMatrix takes well over three hours with a binSize of 1. Even with the default bin size, it can take nearly an hour. Even if it were single-threaded, I would expect a program to query 150,000 regions from a bigWig in a few minutes, not hours.

As a test, I created a toy script that implements basic computeMatrix reference-point functionality. In my benchmarks below, it runs 100-fold faster than the equivalent computeMatrix command, while matching the output. I've tested this on a variety real and simulated bigWigs. For instance, here's how the runtime of my script compares to computeMatrix with a randomly generated bigWig:

Testing deeptools computeMatrix reference-point against Will's ~200 line script
+ date
Mon Mar  3 08:05:56 PM EST 2025
+ : 'Testing Will'\''s script'
+ ./compute_matrix_faster.py input/random.bigWig input/random_genes.bed
computing matrix

real	0m1.841s
user	0m2.906s
sys	0m0.138s
+ : 'Testing deeptools computeMatrix'
+ deeptools --version
deeptools 3.5.6
++ basename output/deeptools.mat.gz .mat.gz
+ OUT_MAT_BS1=output/deeptools.bs1.mat.gz
+ computeMatrix reference-point -bs 1 -S input/random.bigWig -o output/deeptools.bs1.mat.gz -R input/random_genes.bed
using local install of computeMatrix

real	2m4.223s
user	2m3.890s
sys	0m0.240s
+ set +x
Confirming that matrices are the same...
SUCCESS! :)

The test files and my script are found in this repo: https://github.com/wsowens/computeMatrixBench

I realize this doesn't implement all of the functionality of computeMatrix, but the discrepancy in performance is pretty surprising to me. Given the popularity of deeptools, it feels worthwhile to try and optimize this particular command a bit more, and I am happy to help in any way possible.

Do you have any idea why computeMatrix runs so much more slowly than the script I've shared, and do you have any thoughts on how to improve it?

Mar 04 '25 03:03 wsowens

And just to further the discussion, I profiled computeMatrix with cProfile and the aforementioned random bigWig. Here are some of the top functions by cumulative time:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    253/1    0.001    0.000  260.085  260.085 {built-in method builtins.exec}
        1    0.003    0.003  260.085  260.085 computeMatrix:1(<module>)
        1    0.001    0.001  259.771  259.771 computeMatrix.py:472(main)
        1    0.095    0.095  256.368  256.368 heatmapper.py:212(computeMatrix)
        1    0.046    0.046  256.233  256.233 mapReduce.py:8(mapReduce)
     1146    0.052    0.000  256.045    0.223 heatmapper.py:173(compute_sub_matrix_wrapper)
     1146    0.080    0.000  255.993    0.223 heatmapper.py:406(compute_sub_matrix_worker)
     2000    0.196    0.000  255.503    0.128 heatmapper.py:821(coverage_from_big_wig)
     2000   10.243    0.005  254.518    0.127 heatmapper.py:753(coverage_from_array)
  4000000    7.619    0.000  241.966    0.000 heatmapper.py:903(my_average)
  4000002   11.346    0.000  119.360    0.000 core.py:2397(masked_invalid)
  4000000    5.391    0.000  114.101    0.000 core.py:7117(__call__)
  4000001   12.519    0.000  106.193    0.000 core.py:5416(mean)
  4000002    8.045    0.000   99.743    0.000 core.py:1882(masked_where)
  4000001    7.283    0.000   66.135    0.000 core.py:5238(sum)
12007631/12007629    5.584    0.000   46.267    0.000 {method 'view' of 'numpy.ndarray' objects}
  4002011    7.321    0.000   40.683    0.000 core.py:3050(__array_finalize__)
 20010398   39.376    0.000   39.376    0.000 {method 'reduce' of 'numpy.ufunc' objects}
  4000002    8.580    0.000   34.311    0.000 core.py:3866(filled)
  4004016   15.130    0.000   25.286    0.000 core.py:3024(_update_from)
  4000001    8.527    0.000   23.984    0.000 core.py:4633(count)
  8000002    3.856    0.000   22.953    0.000 {method 'sum' of 'numpy.ndarray' objects}
  8000395    3.975    0.000   21.275    0.000 {method 'any' of 'numpy.ndarray' objects}
  8000002    2.269    0.000   19.097    0.000 _methods.py:50(_sum)
  8000395    2.399    0.000   17.301    0.000 _methods.py:58(_any)
  4000003    5.194    0.000   16.061    0.000 core.py:1604(make_mask)
  4000002    1.624    0.000   15.671    0.000 core.py:3603(mask)
  4000003    6.012    0.000   14.047    0.000 core.py:3521(__setmask__)
  4000003    1.880    0.000   13.363    0.000 core.py:1594(_shrink_mask)
  4000001    2.164    0.000   12.884    0.000 core.py:1870(_check_mask_axis)
  8000006    2.523    0.000   12.444    0.000 core.py:1374(make_mask_descr)
  4000003    2.630    0.000   10.974    0.000 core.py:1695(make_mask_none)
  4000001    1.897    0.000   10.720    0.000 {method 'all' of 'numpy.ndarray' objects}
  4000002    4.745    0.000   10.303    0.000 core.py:458(_check_fill_value)
  8000006    6.968    0.000    9.920    0.000 core.py:1360(_replace_dtype_fields)
  4000001    1.217    0.000    8.823    0.000 _methods.py:67(_all)
 16004616    7.097    0.000    7.097    0.000 {built-in method numpy.array}
 22180022    6.916    0.000    6.916    0.000 core.py:3493(dtype)
 44058466    6.138    0.000    6.139    0.000 {built-in method builtins.getattr}
  8005617    2.926    0.000    5.750    0.000 core.py:3771(_get_data)
 37912373    5.663    0.000    5.666    0.000 {built-in method builtins.isinstance}

One major hotspot seems to be this method in heatmapper.py: https://github.com/deeptools/deepTools/blob/ea0f68bb4a1587d713dacb3791861308751ef7d0/deeptools/heatmapper.py#L733-L743

This (admittedly old) StackOverflow discussion suggests that numpy masked arrays are written in pure Python, not C: https://stackoverflow.com/questions/5760668/python-numpy-masked-arrays-are-very-slow

If true, this would explain why 90% of the runtime is spent in my_average. Is the use of masked arrays strictly necessary here, or is it possible to avoid them?

Mar 04 '25 04:03 wsowens

Thanks for the detailed report. We're aware computeMatrix (and others) scale rather poorly with respect to the number of regions (or number of samples, for that matter). We're working on a replacement that alleviates this issue and others, though this is still in an early stage. If you feel brave enough, feel free to give the 4.0.0 branch a test, though be aware that this is still prone to bugs / changes..

Mar 04 '25 09:03 WardDeb

Hi @WardDeb, thanks for getting back to me!

I played around a bit with the 4.0.0 branch, and was very impressed. Here's the same benchmark again:

Testing deeptools computeMatrix reference-point against Will's ~200 line script
+ date
Tue Mar  4 11:04:19 PM EST 2025
+ : 'Testing Will'\''s script'
+ ./compute_matrix_faster.py input/random.bigWig input/random12.bed
computing matrix

real	0m1.774s
user	0m2.789s
sys	0m0.135s
+ : 'Testing deeptools computeMatrix'
+ deeptools --version
deeptools 4.0.0
++ basename output/deeptools.mat.gz .mat.gz
+ OUT_MAT_BS1=output/deeptools.bs1.mat.gz
+ computeMatrix reference-point -bs 1 -S input/random.bigWig -o output/deeptools.bs1.mat.gz -R input/random12.bed

real	0m0.917s
user	0m0.815s
sys	0m0.104s
+ set +x
Confirming that matrices are the same...
Files - and output/test_matrix.tsv differ
Fail :( Matrix values differ.

I did find some odd discrepancies when I ran the same test case on the 4.0.0, but the results largely seem similar. The performance absolutely rocks, huge props to you for adding Rust to this project. I am excited to see whenever 4.0.0 gets a stable release!

Mar 05 '25 04:03 wsowens