Document performance considerations?
I'd like to use pyBigWig to collect values at many intervals from many bigwigs, and I'd love to know what's performant.
- is there overhead to opening a bigwig with pyBigWig? i.e. what's the runtime difference between:
with pyBigWig.open(bigwig_file) as bw:
for chrom, start, stop in intervals:
bw.values(chrom, start, stop)
and
for chrom, start, stop in intervals:
with pyBigWig.open(bigwig_file) as bw:
bw.values(chrom, start, stop)
-
If the former is optimal, is there any advantage to the
intervalsbeing sorted? -
Do you know relative performance of pyBigWig
entries()queries of bigBed files versus tabix queries of gzipped bed files?
I think a vectorized version of bw.values would be much better e.g.
bw.values(np.array([chrom]*3), np.array([79250, 86700, 87277]), np.array([80250, 87700, 88277]), numpy=True)
which returns a list of numpy arrays, without iterating over the intervals in a loop. But I guess this is not implemented yet.
@dpryan79 what is the fastest way to get arrays of values from a bigwig file for each of many genomic intervals (i.e. entries in a bed file)?
For others, I found a better solution for the above-described task was to use the bigWigAverageOverBed tool from UCSC.
$ ./bigWigAverageOverBed
bigWigAverageOverBed v2 - Compute average score of big wig over each bed, which may have introns.
usage:
bigWigAverageOverBed in.bw in.bed out.tab
The output columns are:
name - name field from bed, which should be unique
size - size of bed (sum of exon sizes
covered - # bases within exons covered by bigWig
sum - sum of values over all bases covered
mean0 - average over bases with non-covered bases counting as zeroes
mean - average over just covered bases
Options:
-stats=stats.ra - Output a collection of overall statistics to stat.ra file
-bedOut=out.bed - Make output bed that is echo of input bed but with mean column appended
-sampleAroundCenter=N - Take sample at region N bases wide centered around bed item, rather
than the usual sample in the bed item.
-minMax - include two additional columns containing the min and max observed in the area.