stringr
stringr copied to clipboard
Unexpected behavior using `str_sub` when having bigger or smaller start/end values than the minimum/maximun length of the 'subsetted' string
Dear tidyverse team,
I think I have found an unexpected behavior in str_sub
that I want to report, because I didn't find anything like this in the issue section.
Imagine we have the following string:
string_test <- "MEGUSTAJUGARBEISBOL"
I want to be able to define a truncation site based on a substring (i.e., "JUGAR"
, in my example), and use that information to get the 5 letters before and after the truncation site. In this case, the truncation site would be before the first "J"
, so I would expect the 5 letters after the truncation to be "JUGAR"
and the 5 letters before the truncation to be "GUSTA"
. This works properly in the 1st example, but it doesn't when the trucation site is closer to the beginning of string_test
.
Hopefully I can illustrate this better with the two examples below.
Example 1: shows expected behavior (5 letters before and after properly extracted)
# example 1: truncation at the end of 'MEGUSTA'
peptide_test_1 <- "JUGAR"
str_locate(string_test, peptide_test_1)
start end
[1,] 8 12
start_position <- str_locate(string_test, peptide_test_1)[, 1]
end_position <- str_locate(string_test, peptide_test_1)[, 2]
# 5 letters before truncation site
str_sub(string_test, start_position - 5, start_position - 1)
[1] "GUSTA"
# 5 letters after truncation site
str_sub(string_test, start_position, start_position + 4)
[1] "JUGAR"
Nevertheless, when the 'truncation site' is just at start == 2
of string_test
, I get an empty result, instead of the expected behavior of getting the letter at position at start == 1
. See the example code:
Example 2: truncation after first "M", shows unexpected behavior
# example 2: truncation after first "M"
peptide_test_2 <- "EGUSTA"
str_locate(string_test, peptide_test_2)
start end
[1,] 2 7
start_position <- str_locate(string_test, peptide_test_2)[, 1]
end_position <- str_locate(string_test, peptide_test_2)[, 2]
# 5 AAs before truncation site
> str_sub(string_test, start_position - 5, start_position - 1)
[1] ""
As you can see, I get ""
instead of "M"
, which is the only letter before the 'truncation site'. I would expect to get "M"
if it is the only letter before my 'truncation site'.
I would define this as unexpected behavior, but please let me know if I am missing something.
Thank you very much in advance for taking the time to check this. I will be very happy to receive your feedback on this.
Best wishes, Miguel
Session info:
R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22631)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: Europe/Berlin
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] tibble_3.2.1 tidyr_1.3.0 stringr_1.5.1 dplyr_1.1.2 purrr_1.0.1
[6] readr_2.1.4 here_1.0.1
loaded via a namespace (and not attached):
[1] crayon_1.5.2 vctrs_0.6.3 cli_3.6.1 rlang_1.1.1
[5] stringi_1.7.12 generics_0.1.3 seqinr_4.2-30 jsonlite_1.8.5
[9] glue_1.6.2 bit_4.0.5 rprojroot_2.0.4 hms_1.1.3
[13] fansi_1.0.4 MASS_7.3-60 tzdb_0.4.0 lifecycle_1.0.4
[17] compiler_4.3.2 Rcpp_1.0.11 pkgconfig_2.0.3 R6_2.5.1
[21] tidyselect_1.2.0 utf8_1.2.3 parallel_4.3.2 vroom_1.6.4
[25] pillar_1.9.0 magrittr_2.0.3 withr_2.5.2 tools_4.3.2
[29] bit64_4.0.5 ade4_1.7-22