Biostrings
Biostrings copied to clipboard
trimLRPatterns with indels not trimming as intended in case of net insertions or deletions between subject and pattern?
In the following examples I'd expect trimLRPatterns to always leave ATCG:
library(Biostrings)
subject <- DNAStringSet(c("AAATCTCTCATCG"))
trimLRPatterns(subject = subject, Lpattern = "AAATCTCTC", Rpattern = "" , with.Lindels = TRUE, max.Lmismatch = 0.3 )
#> DNAStringSet object of length 1:
#> width seq
#> [1] 4 ATCG
trimLRPatterns(subject = subject, Lpattern = "AAAATCTCTC", Rpattern = "" , with.Lindels = TRUE, max.Lmismatch = 0.3 )
#> DNAStringSet object of length 1:
#> width seq
#> [1] 3 TCG
trimLRPatterns(subject = subject, Lpattern = "AATCTCTC", Rpattern = "" , with.Lindels = TRUE, max.Lmismatch = 0.3 )
#> DNAStringSet object of length 1:
#> width seq
#> [1] 5 CATCG
Created on 2022-11-18 by the reprex package (v2.0.1)
I think the issue is that Biostrings does not consider terminal gaps in alignments to be indels. See https://github.com/Bioconductor/pwalign/issues/2.
ATCG is retained if you specify with.Lindels = FALSE.
subject <- DNAStringSet(c("AAATCTCTCATCG"))
trimLRPatterns(subject = subject, Lpattern = "AAATCTCTC", Rpattern = "" , with.Lindels = FALSE, max.Lmismatch = 0.3)
#> DNAStringSet object of length 1:
#> width seq
#> [1] 4 ATCG
trimLRPatterns(subject = subject, Lpattern = "AAAATCTCTC", Rpattern = "" , with.Lindels = FALSE, max.Lmismatch = 0.3)
#> DNAStringSet object of length 1:
#> width seq
#> [1] 4 ATCG
trimLRPatterns(subject = subject, Lpattern = "AATCTCTC", Rpattern = "" , with.Lindels = FALSE, max.Lmismatch = 0.3)
#> DNAStringSet object of length 1:
#> width seq
#> [1] 6 TCATCG
Possibly. But while that issue might be considered odd behavior, this here is a bug, right?
Though by I am even more confused by your last with.Lindels = FALSE example right now.