computeMatrix runs substantially slower than expected
Hi there, thank you for your work on deeptools!
I have multiple colleagues who use the computeMatrix (reference-point) and plotHeatmap tools extensively in their work.
Yet, I'm noticing that computeMatrix takes substantially longer than I would expect given what the program is doing. For a typical bigWig file (~1 GB), a regions bedfile that contains ~100,000 ref-seq transcripts, computeMatrix takes well over three hours with a binSize of 1. Even with the default bin size, it can take nearly an hour. Even if it were single-threaded, I would expect a program to query 150,000 regions from a bigWig in a few minutes, not hours.
As a test, I created a toy script that implements basic computeMatrix reference-point functionality. In my benchmarks below, it runs 100-fold faster than the equivalent computeMatrix command, while matching the output. I've tested this on a variety real and simulated bigWigs. For instance, here's how the runtime of my script compares to computeMatrix with a randomly generated bigWig:
Testing deeptools computeMatrix reference-point against Will's ~200 line script
+ date
Mon Mar 3 08:05:56 PM EST 2025
+ : 'Testing Will'\''s script'
+ ./compute_matrix_faster.py input/random.bigWig input/random_genes.bed
computing matrix
real 0m1.841s
user 0m2.906s
sys 0m0.138s
+ : 'Testing deeptools computeMatrix'
+ deeptools --version
deeptools 3.5.6
++ basename output/deeptools.mat.gz .mat.gz
+ OUT_MAT_BS1=output/deeptools.bs1.mat.gz
+ computeMatrix reference-point -bs 1 -S input/random.bigWig -o output/deeptools.bs1.mat.gz -R input/random_genes.bed
using local install of computeMatrix
real 2m4.223s
user 2m3.890s
sys 0m0.240s
+ set +x
Confirming that matrices are the same...
SUCCESS! :)
The test files and my script are found in this repo: https://github.com/wsowens/computeMatrixBench
I realize this doesn't implement all of the functionality of computeMatrix, but the discrepancy in performance is pretty surprising to me. Given the popularity of deeptools, it feels worthwhile to try and optimize this particular command a bit more, and I am happy to help in any way possible.
Do you have any idea why computeMatrix runs so much more slowly than the script I've shared, and do you have any thoughts on how to improve it?
And just to further the discussion, I profiled computeMatrix with cProfile and the aforementioned random bigWig. Here are some of the top functions by cumulative time:
ncalls tottime percall cumtime percall filename:lineno(function)
253/1 0.001 0.000 260.085 260.085 {built-in method builtins.exec}
1 0.003 0.003 260.085 260.085 computeMatrix:1(<module>)
1 0.001 0.001 259.771 259.771 computeMatrix.py:472(main)
1 0.095 0.095 256.368 256.368 heatmapper.py:212(computeMatrix)
1 0.046 0.046 256.233 256.233 mapReduce.py:8(mapReduce)
1146 0.052 0.000 256.045 0.223 heatmapper.py:173(compute_sub_matrix_wrapper)
1146 0.080 0.000 255.993 0.223 heatmapper.py:406(compute_sub_matrix_worker)
2000 0.196 0.000 255.503 0.128 heatmapper.py:821(coverage_from_big_wig)
2000 10.243 0.005 254.518 0.127 heatmapper.py:753(coverage_from_array)
4000000 7.619 0.000 241.966 0.000 heatmapper.py:903(my_average)
4000002 11.346 0.000 119.360 0.000 core.py:2397(masked_invalid)
4000000 5.391 0.000 114.101 0.000 core.py:7117(__call__)
4000001 12.519 0.000 106.193 0.000 core.py:5416(mean)
4000002 8.045 0.000 99.743 0.000 core.py:1882(masked_where)
4000001 7.283 0.000 66.135 0.000 core.py:5238(sum)
12007631/12007629 5.584 0.000 46.267 0.000 {method 'view' of 'numpy.ndarray' objects}
4002011 7.321 0.000 40.683 0.000 core.py:3050(__array_finalize__)
20010398 39.376 0.000 39.376 0.000 {method 'reduce' of 'numpy.ufunc' objects}
4000002 8.580 0.000 34.311 0.000 core.py:3866(filled)
4004016 15.130 0.000 25.286 0.000 core.py:3024(_update_from)
4000001 8.527 0.000 23.984 0.000 core.py:4633(count)
8000002 3.856 0.000 22.953 0.000 {method 'sum' of 'numpy.ndarray' objects}
8000395 3.975 0.000 21.275 0.000 {method 'any' of 'numpy.ndarray' objects}
8000002 2.269 0.000 19.097 0.000 _methods.py:50(_sum)
8000395 2.399 0.000 17.301 0.000 _methods.py:58(_any)
4000003 5.194 0.000 16.061 0.000 core.py:1604(make_mask)
4000002 1.624 0.000 15.671 0.000 core.py:3603(mask)
4000003 6.012 0.000 14.047 0.000 core.py:3521(__setmask__)
4000003 1.880 0.000 13.363 0.000 core.py:1594(_shrink_mask)
4000001 2.164 0.000 12.884 0.000 core.py:1870(_check_mask_axis)
8000006 2.523 0.000 12.444 0.000 core.py:1374(make_mask_descr)
4000003 2.630 0.000 10.974 0.000 core.py:1695(make_mask_none)
4000001 1.897 0.000 10.720 0.000 {method 'all' of 'numpy.ndarray' objects}
4000002 4.745 0.000 10.303 0.000 core.py:458(_check_fill_value)
8000006 6.968 0.000 9.920 0.000 core.py:1360(_replace_dtype_fields)
4000001 1.217 0.000 8.823 0.000 _methods.py:67(_all)
16004616 7.097 0.000 7.097 0.000 {built-in method numpy.array}
22180022 6.916 0.000 6.916 0.000 core.py:3493(dtype)
44058466 6.138 0.000 6.139 0.000 {built-in method builtins.getattr}
8005617 2.926 0.000 5.750 0.000 core.py:3771(_get_data)
37912373 5.663 0.000 5.666 0.000 {built-in method builtins.isinstance}
One major hotspot seems to be this method in heatmapper.py:
https://github.com/deeptools/deepTools/blob/ea0f68bb4a1587d713dacb3791861308751ef7d0/deeptools/heatmapper.py#L733-L743
This (admittedly old) StackOverflow discussion suggests that numpy masked arrays are written in pure Python, not C: https://stackoverflow.com/questions/5760668/python-numpy-masked-arrays-are-very-slow
If true, this would explain why 90% of the runtime is spent in my_average. Is the use of masked arrays strictly necessary here, or is it possible to avoid them?
Thanks for the detailed report. We're aware computeMatrix (and others) scale rather poorly with respect to the number of regions (or number of samples, for that matter). We're working on a replacement that alleviates this issue and others, though this is still in an early stage. If you feel brave enough, feel free to give the 4.0.0 branch a test, though be aware that this is still prone to bugs / changes..
Hi @WardDeb, thanks for getting back to me!
I played around a bit with the 4.0.0 branch, and was very impressed. Here's the same benchmark again:
Testing deeptools computeMatrix reference-point against Will's ~200 line script
+ date
Tue Mar 4 11:04:19 PM EST 2025
+ : 'Testing Will'\''s script'
+ ./compute_matrix_faster.py input/random.bigWig input/random12.bed
computing matrix
real 0m1.774s
user 0m2.789s
sys 0m0.135s
+ : 'Testing deeptools computeMatrix'
+ deeptools --version
deeptools 4.0.0
++ basename output/deeptools.mat.gz .mat.gz
+ OUT_MAT_BS1=output/deeptools.bs1.mat.gz
+ computeMatrix reference-point -bs 1 -S input/random.bigWig -o output/deeptools.bs1.mat.gz -R input/random12.bed
real 0m0.917s
user 0m0.815s
sys 0m0.104s
+ set +x
Confirming that matrices are the same...
Files - and output/test_matrix.tsv differ
Fail :( Matrix values differ.
I did find some odd discrepancies when I ran the same test case on the 4.0.0, but the results largely seem similar. The performance absolutely rocks, huge props to you for adding Rust to this project. I am excited to see whenever 4.0.0 gets a stable release!