pyani
pyani copied to clipboard
Alignment coverage >1.0
Summary:
I am comparing a large number of genomes and occasionally the ANIm_alignment_coverage.tab file indicates a value >1.0. I would have though this value was bounded between [0, 1.0]. How does one interpret an alignment coverage > 1.0?
Description:
I am using PyANI v0.2.11 as follows:
average_nucleotide_identity.py -i genomes/ -o anim -v -m ANIm
I can narrow the issue down to a pair of viral genomes which are in the genomes
directory.
Reproducible Steps:
I can provide the pair of viral genomes which cause this issue.
Current Output:
The ANIm_alignment_coverage.tab
file indicates the alignment coverage between these pairs of genomes is 1.49 and 1.51, respectively.
Expected Output:
I would have though the alignment coverage would have a maximum value of 1.0. That is, any region of the query genome would align to at most one region in the target genome and thus it would be impossible to cover more than 100% of a genome.
pyani Version:
0.2.11
I can add that ANIb produces the expect result for these two genomes with an alignment coverage of 0.98 and 0.99.
Thanks @dparks1134 - that's weird. If you could provide the input genomes that would be very helpful.
Cheers,
L.
FWIW I found (and fixed) a related bug while trying this out on a couple of (unrelated) virus genomes. I don't think it will fix your issue, though.
Thanks - I can confirm your ANIm output. Looking into why, just now.
Well - the issue appears to be that MUMmer is reporting two overlapping alignments. One alignment runs from position 85 to 37713 in the query; the other alignment runs from 17709 to 39253 in 0264574. Inspecting the alignment length output shows that the aligned sequence length is longer than either genome.
We use the --mum
option to ensure that there are unique anchors for the NUCmer alignment, so this was not the behaviour I was expecting. I errored in my understanding of how MUMmer generates alignment output.
I think that what we should have been doing all along is also to use the --noextend
option, by default. With this option we still get alignments at {85, 27720} and {27702, 39253} in the query and the ANI results are otherwise in line with ANIb. I suspect, though don't yet know for certain, that the default --extend
option in NUCmer allows the anchors to extend into each other - and using --mum --noextend
is needed to ensure that alignments are unique to both query and subject sequence (though in this example they still overlap by 18 bases!).
I feel quite bad about missing that. I'm not going to defend it by claiming that we're reimplementing the Rossello-Mora et al. implementation - by default we use --mum
(not indicated in that paper) to try to ensure we don't align repeats. It's just a mistake.
I'm going to add a command-line option --noextend
to pyani v0.2, so that we retain backwards compatibility by default. @baileythegreen - I think we should maybe make --noextend
the default NUCmer behaviour for v0.3, and implement an --extend
option for pyani anim
that allows us to use pyani compare
to see what the damage is likely to have been under the assumption above.
I'll make the changes and push a v0.2.12 release with an apology and explanation.
Many thanks for spotting this and bringing this to our attention @dparks1134 - this is a very important improvement that will impact directly on the next iteration of pyani
. I appreciate the heads-up, and hope it doesn't put you off using the tool.
Cheers,
L.
@all-contributors please add @dparks1134 for bug
Thanks for the detailed explanation. Nice to know what is going on.
@widdowquinn I've created a new issue for this in version 3, and will prioritise that, since it probably qualifies as a bug.
Hi. I implemented the proposed noextend
fix in v0.2.11. This does reduce the number of pairs with an alignment coverage >1, but this still occur in a few instances. The values are just above one in all cases I have observed so perhaps this is fine, but it would be good to ensure things aren't being double counted. I've attached an email pair of genomes.
Thanks @dparks1134 - it does look like even --noextend
cannot guarantee that no bases are double-counted in the MUMmer alignment.
I've been talking with @baileythegreen about potentially using interval graphs to ensure we don't double-count and, if you've found further examples of double-counting with --noextend
already, it looks like we're going to have to do that instead.
I expect this hasn't been noticed already because the overlap size probably gets lost in the noise for bacterial genomes, and the values look plausible.
Hi @widdowquinn, do you know if it is sensible to run dnadiff
and look at the generated 1-to-1 coordinates and/or alignment file? I know at least one person that has taken this approach, but I don't know enough about the MUMmer package to know if this is actually correct.
Thanks for the suggestion - I've taken a look at dnadiff
and it doesn't appear to do anything differently to what pyani
already does - it runs a default nucmer
search and then uses delta-filter
to generate a 1:1 alignment (extension .1coords
with dnadiff
). It does some things in addition like identifying SNPs and reporting M-to-M alignments, but they don't contribute to ANI calculations.
pyani
's current ANIm nucmer
output and the .1coords
output are identical. There are no useful options to dnadiff
that I see; we can't, for instance, pass through either --mum
or --noextend
to nucmer
.
nucmer
default, followed by delta-filter -1
:
/Users/lpritc/Development/GitHub/pyani/issue_340/high_align_cov/MGV-GENOME-0264574.fna /Users/lpritc/Development/GitHub/pyani/issue_340/high_align_cov/MGV-GENOME-0266457.fna
NUCMER
>MGV_MGV-GENOME-0264574 MGV_MGV-GENOME-0266457 39253 39594
85 37713 1 37636 215 215 0
-3013
-24624
-1
-1
-1
-1
-1
0
17709 39253 17626 39176 7 7 0
-9994
-1
-1
-1
-1
-1
0
dnadiff
's .1delta
output:
/Users/lpritc/Development/GitHub/pyani/issue_340/high_align_cov/MGV-GENOME-0264574.fna /Users/lpritc/Development/GitHub/pyani/issue_340/high_align_cov/MGV-GENOME-0266457.fna
NUCMER
>MGV_MGV-GENOME-0264574 MGV_MGV-GENOME-0266457 39253 39594
85 37713 1 37636 215 215 0
-3013
-24624
-1
-1
-1
-1
-1
0
17709 39253 17626 39176 7 7 0
-9994
-1
-1
-1
-1
-1
0
dnadiff's
.report
- notice that it claims a total alignment of 59kbp for the 39kbp genomes:
$ mkdir -p dnadiff_default && dnadiff -p dnadiff_default/high_align_test high_align_cov/MGV-GENOME-0264574.fna high_align_cov/MGV-GENOME-0266457.fna
[...]
/Users/lpritc/Development/GitHub/pyani/issue_340/high_align_cov/MGV-GENOME-0264574.fna /Users/lpritc/Development/GitHub/pyani/issue_340/high_align_cov/MGV-GENOME-0266457.fna
NUCMER
[REF] [QRY]
[Sequences]
TotalSeqs 1 1
AlignedSeqs 1(100.00%) 1(100.00%)
UnalignedSeqs 0(0.00%) 0(0.00%)
[Bases]
TotalBases 39253 39594
AlignedBases 39169(99.79%) 39176(98.94%)
UnalignedBases 84(0.21%) 418(1.06%)
[Alignments]
1-to-1 2 2
TotalLength 59174 59187
AvgLength 29587.00 29593.50
AvgIdentity 99.63 99.63
[...]
I'm surprised and disappointed that even the --noextend
option in nucmer
gives overlapping alignments. We can account for this, but it will take some additional output filtering of the nucmer
output prior to calculation. We're investigating how much of a difference using --noextend
makes to our analyses (the tests seem to be nearly all within tolerance, so far), but it's clearly not going to be the actual solution, if we want to ensure that there are no overlaps between alignments.
Dose it make sense to follow the AlignedBases
approach used by dnadiff? The answer certainly seems sensible in this case, but it is unclear to me exactly how this number is established.
Yes, the AlignedBases
value looks to be a good place to start for coverage. For instance, dnadiff
reports 39169 AlignedBases
for the reference, which corresponds to the 39253 - 85 interval we'd expect when not double-counting the overlap. That helps towards improving the coverage/aligned fraction calculation.
Currently, the ANI identity calculation will be double-counting the overlap, and AlignedBases
doesn't look to help with that. I think we may still need to deduplicate overlapping regions in order to give an accurate ANI value.
Largely a note for me and @baileythegreen after today's discussion...
It looks like we might be able to get everything we need from a combination of the .1coords
and .snps
files that dnadiff
produces, so we can get the "right" answers for identity, coverage, etc.
However, for very large comparisons on a cluster we probably don't want dnadiff
to generate that suite of multiple output files, so it might be best to do something like run show-snps
and show-coords
directly (catching each STDOUT) in a wrapper, and pass the results of the calculation to the database, in v0.3+ - though there may be a better solution.
My initial optimism may have been misplaced. There appear to be some disturbing differences in the Caulobacter test set - especially wrt coverage.
ANIm_alignment_coverage_noextend.pdf ANIm_alignment_coverage.pdf
Further investigation of the dnadiff
and nucmer
parsing is making me optimistic again. It's quite clear that dnadiff
is not using --noextend
, but is using something like a tweaked --maxmatch
, and is then using the output of M-to-M matches (which we avoid), and processing them to remove overlaps to get the value in AlignedBases
.
The difference in output between what we do and what dnadiff
does turns out to be quite small for a bacterial test set.
Where the dnadiff
.report
output gives AlignedBases = 1414384
(reference) or 1424179
(query), the 1:1 alignments are actually 1408533
and 1409770
in total length on reference and query genomes; the M:M alignments are 1442608
and 1443869
respectively.
I'd argue that we should be only using 1:1 alignments, so AlignedBases
is not the right number to use.
However, we were not originally accounting for overlapping alignments. When we do account for them, dnadiff
gives (look at ref_aligned_size
and query_aligned_size
):
infname=PosixPath('2021-11-10/test_dnadiff_output/NC_002696_vs_NC_010338.1coords')
aln_ref=1408533, aln_query=1409770, cov_ref=35.050000000000125, cov_query=25.850000000000044, num_rows=1206, overlaps=11, overlap_len=17664
len(ref_tree)=1195, len(query_tree)=1201
ref_aligned_size=1401112, query_aligned_size=1408639
infname=PosixPath('2021-11-10/test_dnadiff_output/NC_002696_vs_NC_010338.mcoords')
aln_ref=1442608, aln_query=1443869, cov_ref=35.82000000000012, cov_query=26.40000000000006, num_rows=1275, overlaps=53, overlap_len=50359
len(ref_tree)=1222, len(query_tree)=1230
ref_aligned_size=1414420, query_aligned_size=1424194
and our standard pyani
approach gives:
infname=PosixPath('2021-11-10/test_nucmer_output/NC_002696_vs_NC_010338.fcoords')
aln_ref=1388340, aln_query=1389491, cov_ref=34.59000000000013, cov_query=25.51000000000005, num_rows=1173, overlaps=5, overlap_len=7845
len(ref_tree)=1168, len(query_tree)=1172
ref_aligned_size=1384814, query_aligned_size=1389479
With our alignment choices (including 1:1 match filtering) we align about 1.38Mbp where dnadiff
aligns 1.41Mbp.
However, if we use --maxmatch
rather than --mum
, we get:
infname=PosixPath('2021-11-10/test_nucmer_maxmatch_output/NC_002696_vs_NC_010338.fcoords')
aln_ref=1408533, aln_query=1409770, cov_ref=35.050000000000125, cov_query=25.850000000000044, num_rows=1206, overlaps=11, overlap_len=17664
len(ref_tree)=1195, len(query_tree)=1201
ref_aligned_size=1401112, query_aligned_size=1408639
and almost exactly recover the dnadiff
output.
So, while our output differs, it appears to do so mostly because we use --mum
rather than --maxmatch
. As --maxmatch
is an option intended to find inexact repeats and uses all anchors, whether unique or not, --mum
uses only anchors that are unique in both reference and query. We do offer the --maxmatch
option to users already if they don't agree with our default, so I think we might have this covered.
However, that's not the problem with issue #340 ;)
The issue with #340 is that we were not emulating the overlap removal that dnadiff
employs when generating its .report
file. Note that the raw alignment lengths, and the .1coord
/.mcoord
outputs have the same problem we did - this from the original virus comparison:
[Bases]
TotalBases 39253 39594
AlignedBases 39169(99.79%) 39176(98.94%)
UnalignedBases 84(0.21%) 418(1.06%)
[Alignments]
1-to-1 2 2
TotalLength 59174 59187
AvgLength 29587.00 29593.50
AvgIdentity 99.63 99.63
M-to-M 2 2
TotalLength 59174 59187
AvgLength 29587.00 29593.50
AvgIdentity 99.63 99.63
AlignedBases
is great, but the alignment lengths are not.
We can fix this by running an IntervalTree
over the alignment co-ordinates. I did that for the virus comparisons to get:
infname=PosixPath('2021-11-10/virus_dnadiff_output/virus_dnadiff.1coords')
aln_ref=59174, aln_query=59187, cov_ref=150.75, cov_query=149.48, num_rows=2, overlaps=1, overlap_len=39169
len(ref_tree)=1, len(query_tree)=1
ref_aligned_size=39169, query_aligned_size=39176
infname=PosixPath('2021-11-10/virus_dnadiff_output/virus_dnadiff.mcoords')
aln_ref=59174, aln_query=59187, cov_ref=150.75, cov_query=149.48, num_rows=2, overlaps=1, overlap_len=39169
len(ref_tree)=1, len(query_tree)=1
ref_aligned_size=39169, query_aligned_size=39176
infname=PosixPath('2021-11-10/virus_nucmer_output/virus_nucmer.coords')
aln_ref=59174, aln_query=59187, cov_ref=150.75, cov_query=149.48, num_rows=2, overlaps=1, overlap_len=39169
len(ref_tree)=1, len(query_tree)=1
ref_aligned_size=39169, query_aligned_size=39176
and recovers AlignedBases
exactly, whether we use dnadiff
or our existing approach.
So, I think we should be generating a .coords
file output and parsing it through an IntervalTree
to remove/account for overlaps. That's the main change.
I think we already allow users to choose between --maxmatch
/--mum
so no change there.
I don't understand the process by which AlignedBases
is generated exactly, but as it appears to be based on M:M alignments, I think it's possibly the wrong value for us to use, if we're trying to identify uniquely-homologous regions.
Once we correct for the overlaps, I think that our agreement with dnadiff
/other methods should be pretty good and, except in unusual cases like the MAG viruses, I don't currently think it will affect pyani
outputs to a large degree - though it will be an improvement.
Historical note
I think I know why I misunderstood mummer
's operation. I taught sequence alignment as part of a computational biology course for a few years, and would use the Progressive Mauve paper in that (I was rereading it for other reasons, last week). This paper describes mummer
as follows (my emphasis):
MUMmer can identify and align genomes with rearranged homologous sequences. However MUMmer does not align paralogous sequences (repeats within a genome), nor does it align all copies of multi-copy orthologous sequence. Because it aligns any site to at most one site in the other genome, and due to the way it anchors alignment of repetitive sequence using neighboring unique regions, MUMmer often aligns the positionally conserved copy of a repeat element.
I think I might have internalised this explanation and never realised that mummer
can in fact extend those alignments without regard to whether it maps one site in the reference to two or more sites in the query (or vice versa)…
I have been looking at the pybedtools
library after rereading over discussion in #342. The suggestion there was to use it for nucmer
file parsing.
I think pybedtools
concept of an Interval
has a lot of attributes that are not necessarily relevant in our case—strand
for one, chromosome
for another, but the class itself is more amenable to extension/modification than is the case with the IntervalTree
Interval
class.
Example:
pybedtools.Interval
has the following attributes (the Python OOP concept of attributes, not what pybedtools
refers to as attributes):
-
chrom
-
start
-
stop
-
name
-
score
-
strand
-
additional key:value pairs
When any of these are not specified for a given Interval
, they are assigned the value .
. We really only need our intervals to have four attributes for the kind of manipulation we want to do:
-
sequence
-
start
-
stop
-
counterpart
where sequence
is the particular query sequence, and counterpart
is the particular reference sequence.
Because pybedtools
use a more standard class definition for their concept of Interval
, I can define a class which inherits from pybedtools.Interval
and use property decorators and corresponding setter methods to provide alternate names for select attributes, so that they are called things which make more sense within the context of our data:
-
chrom
-->sequence
-
start
-
stop
-
name
-->counterpart
-
score
-
strand
-
additional key:value pairs
This does not remove those attributes from the class (so it wouldn't affect any backend pybedtools
functionality, should we need to use that; it just provides a different name for them that we can use in the pyani
code. It is bypassing the way the pybedtools
documentation indicates Intervals
should be created, but those are all based on file structures that do not apply here, so I think this is fine.
Whether this is preferable to using the custom classes I have written may depend on speed and whatever benefits come from using pybedtools
, if any.
I had a feeling that, in my testing, the route to getting what I believe to be the correct count of AlignedBases
was not very tricky. The (hacky) script I used for that is here:
#!/usr/bin/env python3
import csv
from pathlib import Path
import intervaltree
# Parse .coords files and try to replicate AlignedBases
def parse_coords(infname):
aln_ref = 0
aln_query = 0
cov_ref = 0
cov_query = 0
num_rows = 0
last_row = None
overlaps = 0
overlap_len = 0
ref_intervals = []
query_intervals = []
with infname.open() as ifh:
fieldnames = [
"start_ref",
"end_ref",
"start_query",
"end_query",
"aln_len_ref",
"aln_len_query",
"aln_id",
"len_ref",
"len_query",
"cov_ref",
"cov_query",
"id_ref",
"id_query",
]
reader = csv.DictReader(ifh, delimiter="\t", fieldnames=fieldnames)
for row in reader:
num_rows += 1
aln_ref += int(row["aln_len_ref"])
aln_query += int(row["aln_len_query"])
cov_ref += float(row["cov_ref"])
cov_query += float(row["cov_query"])
ref_intervals.append(sorted((int(row["start_ref"]), int(row["end_ref"]))))
query_intervals.append(
sorted((int(row["start_query"]), int(row["end_query"])))
)
ref_tree = intervaltree.IntervalTree.from_tuples(ref_intervals)
ref_tree.merge_overlaps()
ref_aligned_size = 0
for interval in ref_tree:
ref_aligned_size += interval.end - interval.begin + 1
query_tree = intervaltree.IntervalTree.from_tuples(query_intervals)
query_tree.merge_overlaps()
query_aligned_size = 0
for interval in query_tree:
query_aligned_size += interval.end - interval.begin + 1
print(f"{infname=}")
print(
f"{aln_ref=}, {aln_query=}, {cov_ref=}, {cov_query=}, {num_rows=}" # , {overlaps=}, {overlap_len=}"
)
print(f"{len(ref_tree)=}, {len(query_tree)=}")
print(f"{ref_aligned_size=}, {query_aligned_size=}")
print()
# virus output
infname = Path("2021-11-10/virus_dnadiff_output/virus_dnadiff.1coords")
parse_coords(infname)
infname = Path("2021-11-10/virus_dnadiff_output/virus_dnadiff.mcoords")
parse_coords(infname)
infname = Path("2021-11-10/virus_nucmer_output/virus_nucmer.coords")
parse_coords(infname)
@widdowquinn I think that script is merging intervals it should not; at least, based on what I understood from a discussion a while back about when intervals should be merged. For example:
Assume the input file:
1 5 20 24 300 300 90.0 500 330 1.0 1.0 id1 id2
3 7 97 24 300 300 90.0 500 400 1.0 1.0 id1 id3
1146 1183 1 37 300 300 100.00 2000 700 3.14 100.00 id4 id5
1 164 164 1 300 300 100.00 2000 800 13.88 21.78 id4 id6
163 166 179 182 300 300 100.00 2000 800 0.29 0.46 id4 id6
165 1140 1 974 300 300 100.00 2000 1000 82.34 100.00 id4 id7
This has data on two query contigs:
id1
1 5 20 24 300 300 90.0 500 330 1.0 1.0 id1 id2
3 7 97 24 300 300 90.0 500 400 1.0 1.0 id1 id3
and id4
1146 1183 1 37 300 300 100.00 2000 700 3.14 100.00 id4 id5
1 164 164 1 300 300 100.00 2000 800 13.88 21.78 id4 id6
163 166 179 182 300 300 100.00 2000 800 0.29 0.46 id4 id6
165 1140 1 974 300 300 100.00 2000 1000 82.34 100.00 id4 id7
If overlapping intervals should be merged so long as they belong to the same contig, then the query intervals for id1
should be merged, producing one interval of 1–7
, and for id4
, the last three intervals should also be merged, giving [1–1140
, 1146–1183
].
If, however, the identity of the reference contig must also be considered before deciding to merge, then id1
reduces to [1–5
, 3–7
] and id4
to [1–166
, 165–1140
, 1146–1183
].
With your script, I get the former (inserting a print
statement following the merge_overlaps()
calls), but from our discussion a while back, I was under the impression that it should be the latter.
Which is it?
We could have left this to tomorrow's meeting, but it may be useful to have a written record here, to avoid misunderstanding.
In a pairwise comparison, there are two genomes being compared. Let's call them genomes A and B.
In your example above, contigs id1
and id4
belong to genome A. Contigs id2
, id3
, id5
, id6
and id7
belong to genome B.
We are interested in which parts of genome A align to genome B (and vice versa, but as in your example let's consider only genome A). We are interested, for the purpose of calculating genome coverage, in the number of unique bases in genome A that align to genome B. It doesn't matter to which parts of genome B the regions of genome A align, and specifically it doesn't matter which contigs they align to, so long as we are willing to accept each of the individual alignments as valid.
Here, genome A is divided into two contigs: id1
and id4
.
The regions of id1
that align to genome B are 1-5
and 3-7
, so we merge these to get the single region 1-7
.
The regions of id4
that align to genome B are 1-164
, 163-166
, 165-1140
, and 1146-1183
, so we merge these to get the two regions 1-1140
and 1146-1183
.
I hope this clarifies things.
Fixed with 1f10a6c.