Biostrings icon indicating copy to clipboard operation
Biostrings copied to clipboard

Allow returning after only first match of pattern

Open BlueDrink9 opened this issue 1 year ago • 3 comments

The *matchPattern* family of functions currently return as many matches as possible. However, sometimes you just want to know whether there is any match between two sequences, so it is inefficient to continue searching after a match is found.

When searching a long list of sequences that are slightly longer than the pattern, vmatchPattern is about 2x slower than grepl(..., fixed=True), despite grepl having to translate to characters.

I need to eke out as much performance as possible because of the volume I am working with. It would be good to have an option, similar to grepRaw, to return early as soon as a match is found.

BlueDrink9 avatar Oct 31 '24 03:10 BlueDrink9

Sounds like a reasonable request and a good feature to have IMO.

Putting some timings here for reference:

library(Biostrings)
library(hgu95av2probe)
probes <- DNAStringSet(hgu95av2probe)

system.time(ok1 <- grepl("CATATG", hgu95av2probe$sequence, fixed=TRUE))
#    user  system elapsed 
#   0.005   0.000   0.005 

system.time(ok2 <- vcountPattern("CATATG", probes) != 0L)
#    user  system elapsed 
#   0.094   0.000   0.094 

identical(ok1, ok2)
# [1] TRUE

length(which(ok1))
# [1] 1086

library(microbenchmark)
microbenchmark(grepl("CATATG", hgu95av2probe$sequence, fixed=TRUE), vcountPattern("CATATG", probes) != 0L)
# Unit: milliseconds
#                                                   expr       min        lq
#  grepl("CATATG", hgu95av2probe$sequence, fixed = TRUE)  4.380226  4.564893
#                  vcountPattern("CATATG", probes) != 0L 90.140803 92.999982
#       mean   median        uq      max neval
#   5.095201  4.91384  5.397511  12.7959   100
#  94.830566 93.89611 94.740594 116.4534   100

Could be controlled via a first.only argument, set to FALSE to preserve the current behavior. Could be added to matchPattern()/vmatchPattern() and countPattern()/vcountPattern().

Note that the *PDict() family already supports something similar to this via the whichPDict()/vwhichPDict() functions but with the important difference that these functions won't return any information about the exact location of the first match (or any other match for that matter) in the subject sequence(s).

Any thoughts @ahl27 ?

Finally, and FWIW, a word of caution about grepRaw(all=TRUE). It's important to realize that even with all=TRUE, grepRaw() does not return all the matches:

pattern <- as.raw(c(5, 1, 5, 1))
x <- as.raw(c(99, 99, 99, 5, 1, 5, 1, 5, 1, 199))
grepRaw(pattern, x, fixed=TRUE, all=TRUE)
# [1] 4

The fact that it discards matches that overlap with a previous match is not mentioned in the documentation, which is unfortunate because this makes it unfit for searching most biological sequences.

H.

sessionInfo()

R Under development (unstable) (2024-10-22 r87265)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 23.10

Matrix products: default
BLAS:   /home/hpages/R/R-4.5.r87265/lib/libRblas.so 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.11.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] microbenchmark_1.5.0 hgu95av2probe_2.18.0 AnnotationDbi_1.69.0
 [4] Biobase_2.67.0       Biostrings_2.75.0    GenomeInfoDb_1.43.0 
 [7] XVector_0.47.0       IRanges_2.41.0       S4Vectors_0.45.0    
[10] BiocGenerics_0.53.1  generics_0.1.3      

loaded via a namespace (and not attached):
 [1] crayon_1.5.3            vctrs_0.6.5             httr_1.4.7             
 [4] cli_3.6.3               rlang_1.1.4             DBI_1.2.3              
 [7] png_0.1-8               UCSC.utils_1.3.0        jsonlite_1.8.9         
[10] bit_4.5.0               KEGGREST_1.47.0         fastmap_1.2.0          
[13] memoise_2.0.1           compiler_4.5.0          RSQLite_2.3.7          
[16] blob_1.2.4              R6_2.5.1                GenomeInfoDbData_1.2.13
[19] tools_4.5.0             bit64_4.5.2             zlibbioc_1.53.0        
[22] cachem_1.1.0 

hpages avatar Oct 31 '24 22:10 hpages

Seems like a good idea to me. I'm thinking it might just be better to add a limit argument (like nrec in readXStringSet or nlines in read.table) to future proof, with the default set to 0 for unlimited returns. I could see a request in the future asking for the first 2/3/4/n entries as well, and that would also address that.

I can add this to my todos for this cycle, should be fairly easy to implement.

ahl27 avatar Oct 31 '24 22:10 ahl27

Yep, limit sounds good. Could also be used in the future to request the last 2/3/4/n entries by letting the user specify a negative value if there's interest.

Thanks!

hpages avatar Oct 31 '24 23:10 hpages