Allow returning after only first match of pattern
The *matchPattern* family of functions currently return as many matches as possible. However, sometimes you just want to know whether there is any match between two sequences, so it is inefficient to continue searching after a match is found.
When searching a long list of sequences that are slightly longer than the pattern, vmatchPattern is about 2x slower than grepl(..., fixed=True), despite grepl having to translate to characters.
I need to eke out as much performance as possible because of the volume I am working with. It would be good to have an option, similar to grepRaw, to return early as soon as a match is found.
Sounds like a reasonable request and a good feature to have IMO.
Putting some timings here for reference:
library(Biostrings)
library(hgu95av2probe)
probes <- DNAStringSet(hgu95av2probe)
system.time(ok1 <- grepl("CATATG", hgu95av2probe$sequence, fixed=TRUE))
# user system elapsed
# 0.005 0.000 0.005
system.time(ok2 <- vcountPattern("CATATG", probes) != 0L)
# user system elapsed
# 0.094 0.000 0.094
identical(ok1, ok2)
# [1] TRUE
length(which(ok1))
# [1] 1086
library(microbenchmark)
microbenchmark(grepl("CATATG", hgu95av2probe$sequence, fixed=TRUE), vcountPattern("CATATG", probes) != 0L)
# Unit: milliseconds
# expr min lq
# grepl("CATATG", hgu95av2probe$sequence, fixed = TRUE) 4.380226 4.564893
# vcountPattern("CATATG", probes) != 0L 90.140803 92.999982
# mean median uq max neval
# 5.095201 4.91384 5.397511 12.7959 100
# 94.830566 93.89611 94.740594 116.4534 100
Could be controlled via a first.only argument, set to FALSE to preserve the current behavior. Could be added to matchPattern()/vmatchPattern() and countPattern()/vcountPattern().
Note that the *PDict() family already supports something similar to this via the whichPDict()/vwhichPDict() functions but with the important difference that these functions won't return any information about the exact location of the first match (or any other match for that matter) in the subject sequence(s).
Any thoughts @ahl27 ?
Finally, and FWIW, a word of caution about grepRaw(all=TRUE). It's important to realize that even with all=TRUE, grepRaw() does not return all the matches:
pattern <- as.raw(c(5, 1, 5, 1))
x <- as.raw(c(99, 99, 99, 5, 1, 5, 1, 5, 1, 199))
grepRaw(pattern, x, fixed=TRUE, all=TRUE)
# [1] 4
The fact that it discards matches that overlap with a previous match is not mentioned in the documentation, which is unfortunate because this makes it unfit for searching most biological sequences.
H.
sessionInfo()
R Under development (unstable) (2024-10-22 r87265)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 23.10
Matrix products: default
BLAS: /home/hpages/R/R-4.5.r87265/lib/libRblas.so
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.11.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: America/Los_Angeles
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] microbenchmark_1.5.0 hgu95av2probe_2.18.0 AnnotationDbi_1.69.0
[4] Biobase_2.67.0 Biostrings_2.75.0 GenomeInfoDb_1.43.0
[7] XVector_0.47.0 IRanges_2.41.0 S4Vectors_0.45.0
[10] BiocGenerics_0.53.1 generics_0.1.3
loaded via a namespace (and not attached):
[1] crayon_1.5.3 vctrs_0.6.5 httr_1.4.7
[4] cli_3.6.3 rlang_1.1.4 DBI_1.2.3
[7] png_0.1-8 UCSC.utils_1.3.0 jsonlite_1.8.9
[10] bit_4.5.0 KEGGREST_1.47.0 fastmap_1.2.0
[13] memoise_2.0.1 compiler_4.5.0 RSQLite_2.3.7
[16] blob_1.2.4 R6_2.5.1 GenomeInfoDbData_1.2.13
[19] tools_4.5.0 bit64_4.5.2 zlibbioc_1.53.0
[22] cachem_1.1.0
Seems like a good idea to me. I'm thinking it might just be better to add a limit argument (like nrec in readXStringSet or nlines in read.table) to future proof, with the default set to 0 for unlimited returns. I could see a request in the future asking for the first 2/3/4/n entries as well, and that would also address that.
I can add this to my todos for this cycle, should be fairly easy to implement.
Yep, limit sounds good. Could also be used in the future to request the last 2/3/4/n entries by letting the user specify a negative value if there's interest.
Thanks!