mafTools
mafTools copied to clipboard
mafPairCoverage output bug
I'm seeing some strange output in mafPairCoverage when looking at coverage in regions specified from a .bed file. It looks like In Regions and Out of Regions are reversed (which is causing coverage over 100%):
# BED Regions
# Sequence Region Length In Regions Out of Regions Coverage
A_tha* 59894120 77296918 20069410 1.290559e+00
A_tha.1 15946509 22110347 5122972 1.386532e+00
Thanks for the report, this sounds interesting. The code has unit tests — feel free to add your examples of failures and issue a pull request and I'll evaluate. Or send me links to your failure examples so that I can reproduce them.
On Thursday, June 11, 2015, ens-bwalts [email protected] wrote:
I'm seeing some strange output in mafPairCoverage when looking at coverage in regions specified from a .bed file. It looks like In Regions and Out of Regions are reversed (which is causing coverage over 100%):
BED Regions
Sequence Region Length In Regions Out of Regions Coverage
A_tha* 59894120 77296918 20069410 1.290559e+00 A_tha.1 15946509 22110347 5122972 1.386532e+00
— Reply to this email directly or view it on GitHub https://github.com/dentearl/mafTools/issues/10.
It looks like it's summing all aligned positions within the bed region.
If I provide this .maf:
##maf version=1 scoring=N/A
a
s Anc0.Anc0refChr4307 30 33 + 10240 CTCACCTCCTAAAGAAACCAAGTACAAATTAAA
s Anc2.Anc2refChr1726 125 33 - 10260 CTCACCTCCTAAAGGAGCCAAGTACAAATTAAA
s Anc2.Anc2refChr4731 20097 33 - 64532 ATCACCTTCTAAAGAAACCAAGTACAAATTGAA
s A_lyr.2 17885422 33 - 19320864 CTCACCTCCTAAAGGAGCCAAGTACAAATTAAA
s A_lyr.1 5031915 33 + 33132539 CTCACCTTCTAAAGAAACCAAGTACAAATTGAA
s A_tha.1 4214339 33 + 30427671 ATCACCTTCTAAAGAAACCAAGTACAAATTGAA
s A_tha.1 23284080 33 + 30427671 CTCACCTCCTAAAGGAGCCAAGTACAAATTGAA
And look for the regions in this .bed:
A_tha.1 4214340 4214343
A_tha.1 5000000 5001000
mafPairCoverage -m test.maf --seq1 A_tha* --seq2 A_lyr* --bed test.bed -v 1
gives this output
# Overall
# Sequence Source Length Obs Length Aligned Pos. Coverage Obs Coverage
A_tha* 30427671 66 132 4.338157e-06 2.000000e+00
A_lyr* 52453403 66 132 2.516519e-06 2.000000e+00
# Individual Sequences
# Sequence Source Length Obs Length Aligned Pos. Coverage Obs Coverage
A_tha.1 30427671 66 132 4.338157e-06 2.000000e+00
A_lyr.1 33132539 33 66 1.991999e-06 2.000000e+00
A_lyr.2 19320864 33 66 3.415996e-06 2.000000e+00
# BED Regions
# Sequence Region Length In Regions Out of Regions Coverage
A_tha* 1003 6 126 5.982054e-03
A_tha.1 1003 6 126 5.982054e-03
Counting 3 bases for the alignment to A_lyr.1 and 3 bases for the alignment to A_lyr.2 as being in regions.
Thanks for posting an example! Okay, I'll try to eek out some time to investigate this in the next few weeks (my full time job doesn't allow for much hacking around currently).