BioAlignments.jl LocalAlignment find lower score than biopython/scikit-bio implementations

LocalAlignment find lower score than biopython/scikit-bio implementations

Open diegozea opened this issue 5 years ago • 5 comments

While I was benchmarking Julia's pairalign (318 μs) against BioPython (7310 µs) and scikit-bio (925 µs) implementation, I found that Julia's local alignment returns a solution with score 63 while python solutions have a score of 65. That makes me think that we have some hidden bug there, and we are not hitting the right solution.

Python 3.5.2

from Bio import pairwise2                                                                                                                                                                                                            
from Bio.SubsMat import MatrixInfo as matlist                                                                                                                                                                                        
matrix = matlist.blosum50                                                                                                                                                                                                            
pairwise2.align.localds("ETPRAHGALTSDNSGTTLFGKPEPMSSAEATPTASEIRNPVFSGKMDGNSLKQADSTSTR", "KEEAGSLRNEESMLKGKAEPMIYGKGEPGTVGRVDCTASGAENSGSLGKVDMPCSSKVDI", matrix, -10, -1)

[('ETPRAHGALTSDNSGTTLFGKPEPMSSAEATP--------TASEIRNPVFSGKMDGNSLKQADSTSTR',
  '--KEEAGSLRNEES--MLKGKAEPMIYGKGEPGTVGRVDCTASGAENSGSLGKVDMPCSSKVDI----',
  65.0,
  6,
  63),
 ('ETPRAHGALTSDNSGTTLFGKPEPMSSAEATP-TASEIRNPVFSGKMDGNSLKQADSTSTR----',
  '--KEEAGSLRNEES--MLKGKAEPMIYGKGEPGTVGRV-DCTASGAENSGSLGKVDMPCSSKVDI',
  65.0,
  6,
  56),
 ('ETPRAHGALTSDNSGTTLFGKPEPMSSAEATP-TASEIRNPVFSGKMDGNSLKQADSTSTR----',
  '--KEEAGSLRNEES--MLKGKAEPMIYGKGEPGTVGRVDCTA-SGAENSGSLGKVDMPCSSKVDI',
  65.0,
  6,
  56)]

from skbio.alignment import local_pairwise_align_ssw                                                                                                                                                                                 
from skbio.alignment._pairwise import blosum50                                                                                                                                                                                       
from skbio import Protein                                                                                                                                                                                                            
local_pairwise_align_ssw(Protein("ETPRAHGALTSDNSGTTLFGKPEPMSSAEATPTASEIRNPVFSGKMDGNSLKQADSTSTR"), Protein("KEEAGSLRNEESMLKGKAEPMIYGKGEPGTVGRVDCTASGAENSGSLGKVDMPCSSKVDI"), gap_open_penalty=10, gap_extend_penalty=1, substitution_matrix=blosum50)

(TabularMSA[Protein]
 --------------------------------------------------
 Stats:
     sequence count: 2
     position count: 50
 --------------------------------------------------
 GALTSDNSGTTLFGKPEPMSSAEA-TPTASEIRNPVFSGKMDGNSLKQAD
 GSLRNEES--MLKGKAEPMIYGKGEPGTVGR-VDCTASGAENSGSLGKVD, 65, [(6, 54), (4, 50)])

using BioAlignments
pairalign(LocalAlignment(), "ETPRAHGALTSDNSGTTLFGKPEPMSSAEATPTASEIRNPVFSGKMDGNSLKQADSTSTR", "KEEAGSLRNEESMLKGKAEPMIYGKGEPGTVGRVDCTASGAENSGSLGKVDMPCSSKVDI", AffineGapScoreModel(BLOSUM50, gap_open=-10, gap_extend=-1))

PairwiseAlignmentResult{Int64,String,String}:
  score: 63
  seq:  7 GALTSDNSGTTLFGKPEPMSSAEATP--------TASEIRNPVFSGKMDGNSLKQAD 55
          | |    |   | || |||      |        |||   |    || |       |
  ref:  5 GSLRNEES--MLKGKAEPMIYGKGEPGTVGRVDCTASGAENSGSLGKVDMPCSSKVD 59

In this case, BioJulia's solution is part of BioPython's first solution.

julia> versioninfo()
Julia Version 1.0.2
Commit d789231e99 (2018-11-08 20:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, haswell)

(v1.0) pkg> st
    Status `~/.julia/environments/v1.0/Project.toml`
  [c7e460c6] ArgParse v0.6.1
  [c52e3926] Atom v0.7.12
  [6e4b80f9] BenchmarkTools v0.4.1
  [e334fcbd] Bio3DView v0.0.0 #master (https://github.com/jgreener64/Bio3DView.jl)
  [00701ae9] BioAlignments v1.0.0
  [7e6ae17a] BioSequences v1.1.0
  [3c28c6f8] BioSymbols v3.1.0
  [336ed68f] CSV v0.4.3
  [a93c6f00] DataFrames v0.16.0
  [864edb3b] DataStructures v0.15.0
  [ab62b9b5] DeepDiffs v1.1.0
  [e30172f5] Documenter v0.21.0
  [5789e2e9] FileIO v1.0.5
  [1fa38f19] Format v0.7.2 [`~/.julia/dev/Format`]
  [59287772] Formatting v0.3.5
  [28b8d3ca] GR v0.37.0
  [92cae0f8] GaussDCA v0.0.0 #master (https://github.com/carlobaldassi/GaussDCA.jl)
  [c27321d9] Glob v1.2.0
  [bd48cda9] GraphRecipes v0.4.0
  [a2bd30eb] Graphics v0.4.0
  [4c0ca9eb] Gtk v0.16.4
  [7073ff75] IJulia v1.15.2
  [18364772] IPython v0.4.0
  [6218d12a] ImageMagick v0.7.1
  [86fae568] ImageView v0.8.2
  [916415d5] Images v0.17.0
  [d0351b0e] InspectDR v0.3.3
  [c601a237] Interact v0.9.0
  [e5e0dc1b] Juno v0.5.3
  [b964fa9f] LaTeXStrings v1.0.3
  [51bafb47] MIToS v2.3.3+ [`~/.julia/dev/MIToS`]
  [86f7a689] NamedArrays v0.9.2
  [47be7bcc] ORCA v0.2.0
  [3b7a836e] PGFPlots v3.0.1
  [f9da4da7] PairwiseListMatrices v0.7.0+ #master (https://github.com/diegozea/PairwiseListMatrices.jl.git)
  [0e44f1d2] PlotRecipes v0.3.0+ [`~/.julia/dev/PlotRecipes`]
  [f0f68f2c] PlotlyJS v0.12.2
  [91a5bcdd] Plots v0.22.5
  [c46f51b8] ProfileView v0.4.0
  [438e738f] PyCall v1.18.5
  [d330b81b] PyPlot v2.7.0
  [f535d66d] ROCAnalysis v0.3.0
  [3cdcf5f2] RecipesBase v0.6.0
  [60ddc479] StatPlots v0.8.2
  [88034a9c] StringDistances v0.3.0
  [37b6cedf] Traceur v0.2.0
  [b8865327] UnicodePlots v1.0.1

Jan 17 '19 16:01 diegozea

How do we know it's BioJulia and not biopython that has the bug?

The key difference seems to be the BioJulia version inserts a bigger gap near the middle, to match more elements furthur along the sequence.

Is it possible to print out how the BioJulia and the biopython algorithms fill out the dynamic programming matrix? If they're the same, then theres something different in how traceback occurs, if they differ, then something is being scored differently.

Jan 18 '19 21:01 TransGirlCodes

The issue here is that BioJulia handles gap open and extend penalties a bit differently from BioPython. A gap of size k costs gap_open + k * gap_extend in BioJulia. The same gap costs gap_open + (k - 1) * gap_extend in BioPython. So, in order to test them against each other, one would have to subtract one extension penalty from the open penalty in Python, i.e. use -11 in this example:

pairwise2.align.localds(str1, str2, matrix, -11, -1)

Feb 04 '19 17:02 tanhevg

Is there a reason to prefer gap_open + k * gap_extend over gap_open + (k - 1) * gap_extend?

Feb 04 '19 18:02 diegozea

From what I've seen in the literature, original papers from the 80's and 90's mostly use the gap_open + k * gap_extend notation. One could argue that gap_open + (k - 1) * gap_extend is somewhat cleaner, because it shows that the penalty for the first character in the gap is gap_open (as opposed to the other notation, where the penalty for the first character is gap_open + gap_extend).

This is one of those cases though where once you've picked a notation you don't want to change it, because if you do everybody's code will silently stop working after upgrading.

Feb 04 '19 18:02 tanhevg

We should keep this issue pinned, perhaps until we add a page to the docs answering FAQs like this: "Hey why does BioAlignments give a different result to XYZ?"

Feb 09 '19 20:02 TransGirlCodes

BioAlignments.jl BioAlignments.jl copied to clipboard

LocalAlignment find lower score than biopython/scikit-bio implementations

BioAlignments.jl
BioAlignments.jl copied to clipboard