BioAlignments.jl
BioAlignments.jl copied to clipboard
LocalAlignment find lower score than biopython/scikit-bio implementations
While I was benchmarking Julia's pairalign
(318 μs) against BioPython (7310 µs) and scikit-bio (925 µs) implementation, I found that Julia's local alignment returns a solution with score 63 while python solutions have a score of 65. That makes me think that we have some hidden bug there, and we are not hitting the right solution.
Python 3.5.2
from Bio import pairwise2
from Bio.SubsMat import MatrixInfo as matlist
matrix = matlist.blosum50
pairwise2.align.localds("ETPRAHGALTSDNSGTTLFGKPEPMSSAEATPTASEIRNPVFSGKMDGNSLKQADSTSTR", "KEEAGSLRNEESMLKGKAEPMIYGKGEPGTVGRVDCTASGAENSGSLGKVDMPCSSKVDI", matrix, -10, -1)
[('ETPRAHGALTSDNSGTTLFGKPEPMSSAEATP--------TASEIRNPVFSGKMDGNSLKQADSTSTR',
'--KEEAGSLRNEES--MLKGKAEPMIYGKGEPGTVGRVDCTASGAENSGSLGKVDMPCSSKVDI----',
65.0,
6,
63),
('ETPRAHGALTSDNSGTTLFGKPEPMSSAEATP-TASEIRNPVFSGKMDGNSLKQADSTSTR----',
'--KEEAGSLRNEES--MLKGKAEPMIYGKGEPGTVGRV-DCTASGAENSGSLGKVDMPCSSKVDI',
65.0,
6,
56),
('ETPRAHGALTSDNSGTTLFGKPEPMSSAEATP-TASEIRNPVFSGKMDGNSLKQADSTSTR----',
'--KEEAGSLRNEES--MLKGKAEPMIYGKGEPGTVGRVDCTA-SGAENSGSLGKVDMPCSSKVDI',
65.0,
6,
56)]
from skbio.alignment import local_pairwise_align_ssw
from skbio.alignment._pairwise import blosum50
from skbio import Protein
local_pairwise_align_ssw(Protein("ETPRAHGALTSDNSGTTLFGKPEPMSSAEATPTASEIRNPVFSGKMDGNSLKQADSTSTR"), Protein("KEEAGSLRNEESMLKGKAEPMIYGKGEPGTVGRVDCTASGAENSGSLGKVDMPCSSKVDI"), gap_open_penalty=10, gap_extend_penalty=1, substitution_matrix=blosum50)
(TabularMSA[Protein]
--------------------------------------------------
Stats:
sequence count: 2
position count: 50
--------------------------------------------------
GALTSDNSGTTLFGKPEPMSSAEA-TPTASEIRNPVFSGKMDGNSLKQAD
GSLRNEES--MLKGKAEPMIYGKGEPGTVGR-VDCTASGAENSGSLGKVD, 65, [(6, 54), (4, 50)])
using BioAlignments
pairalign(LocalAlignment(), "ETPRAHGALTSDNSGTTLFGKPEPMSSAEATPTASEIRNPVFSGKMDGNSLKQADSTSTR", "KEEAGSLRNEESMLKGKAEPMIYGKGEPGTVGRVDCTASGAENSGSLGKVDMPCSSKVDI", AffineGapScoreModel(BLOSUM50, gap_open=-10, gap_extend=-1))
PairwiseAlignmentResult{Int64,String,String}:
score: 63
seq: 7 GALTSDNSGTTLFGKPEPMSSAEATP--------TASEIRNPVFSGKMDGNSLKQAD 55
| | | | || ||| | ||| | || | |
ref: 5 GSLRNEES--MLKGKAEPMIYGKGEPGTVGRVDCTASGAENSGSLGKVDMPCSSKVD 59
In this case, BioJulia's solution is part of BioPython's first solution.
julia> versioninfo()
Julia Version 1.0.2
Commit d789231e99 (2018-11-08 20:11 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.0 (ORCJIT, haswell)
(v1.0) pkg> st
Status `~/.julia/environments/v1.0/Project.toml`
[c7e460c6] ArgParse v0.6.1
[c52e3926] Atom v0.7.12
[6e4b80f9] BenchmarkTools v0.4.1
[e334fcbd] Bio3DView v0.0.0 #master (https://github.com/jgreener64/Bio3DView.jl)
[00701ae9] BioAlignments v1.0.0
[7e6ae17a] BioSequences v1.1.0
[3c28c6f8] BioSymbols v3.1.0
[336ed68f] CSV v0.4.3
[a93c6f00] DataFrames v0.16.0
[864edb3b] DataStructures v0.15.0
[ab62b9b5] DeepDiffs v1.1.0
[e30172f5] Documenter v0.21.0
[5789e2e9] FileIO v1.0.5
[1fa38f19] Format v0.7.2 [`~/.julia/dev/Format`]
[59287772] Formatting v0.3.5
[28b8d3ca] GR v0.37.0
[92cae0f8] GaussDCA v0.0.0 #master (https://github.com/carlobaldassi/GaussDCA.jl)
[c27321d9] Glob v1.2.0
[bd48cda9] GraphRecipes v0.4.0
[a2bd30eb] Graphics v0.4.0
[4c0ca9eb] Gtk v0.16.4
[7073ff75] IJulia v1.15.2
[18364772] IPython v0.4.0
[6218d12a] ImageMagick v0.7.1
[86fae568] ImageView v0.8.2
[916415d5] Images v0.17.0
[d0351b0e] InspectDR v0.3.3
[c601a237] Interact v0.9.0
[e5e0dc1b] Juno v0.5.3
[b964fa9f] LaTeXStrings v1.0.3
[51bafb47] MIToS v2.3.3+ [`~/.julia/dev/MIToS`]
[86f7a689] NamedArrays v0.9.2
[47be7bcc] ORCA v0.2.0
[3b7a836e] PGFPlots v3.0.1
[f9da4da7] PairwiseListMatrices v0.7.0+ #master (https://github.com/diegozea/PairwiseListMatrices.jl.git)
[0e44f1d2] PlotRecipes v0.3.0+ [`~/.julia/dev/PlotRecipes`]
[f0f68f2c] PlotlyJS v0.12.2
[91a5bcdd] Plots v0.22.5
[c46f51b8] ProfileView v0.4.0
[438e738f] PyCall v1.18.5
[d330b81b] PyPlot v2.7.0
[f535d66d] ROCAnalysis v0.3.0
[3cdcf5f2] RecipesBase v0.6.0
[60ddc479] StatPlots v0.8.2
[88034a9c] StringDistances v0.3.0
[37b6cedf] Traceur v0.2.0
[b8865327] UnicodePlots v1.0.1
How do we know it's BioJulia and not biopython that has the bug?
The key difference seems to be the BioJulia version inserts a bigger gap near the middle, to match more elements furthur along the sequence.
Is it possible to print out how the BioJulia and the biopython algorithms fill out the dynamic programming matrix? If they're the same, then theres something different in how traceback occurs, if they differ, then something is being scored differently.
The issue here is that BioJulia handles gap open and extend penalties a bit differently from BioPython. A gap of size k
costs gap_open + k * gap_extend
in BioJulia. The same gap costs gap_open + (k - 1) * gap_extend
in BioPython. So, in order to test them against each other, one would have to subtract one extension penalty from the open penalty in Python, i.e. use -11 in this example:
pairwise2.align.localds(str1, str2, matrix, -11, -1)
Is there a reason to prefer gap_open + k * gap_extend
over gap_open + (k - 1) * gap_extend
?
From what I've seen in the literature, original papers from the 80's and 90's mostly use the gap_open + k * gap_extend
notation. One could argue that gap_open + (k - 1) * gap_extend
is somewhat cleaner, because it shows that the penalty for the first character in the gap is gap_open
(as opposed to the other notation, where the penalty for the first character is gap_open + gap_extend
).
This is one of those cases though where once you've picked a notation you don't want to change it, because if you do everybody's code will silently stop working after upgrading.
We should keep this issue pinned, perhaps until we add a page to the docs answering FAQs like this: "Hey why does BioAlignments give a different result to XYZ?"