stringr icon indicating copy to clipboard operation
stringr copied to clipboard

Unexpected behavior using `str_sub` when having bigger or smaller start/end values than the minimum/maximun length of the 'subsetted' string

Open MiguelCos opened this issue 10 months ago • 0 comments

Dear tidyverse team,

I think I have found an unexpected behavior in str_sub that I want to report, because I didn't find anything like this in the issue section.

Imagine we have the following string:

string_test <- "MEGUSTAJUGARBEISBOL"

I want to be able to define a truncation site based on a substring (i.e., "JUGAR", in my example), and use that information to get the 5 letters before and after the truncation site. In this case, the truncation site would be before the first "J", so I would expect the 5 letters after the truncation to be "JUGAR" and the 5 letters before the truncation to be "GUSTA". This works properly in the 1st example, but it doesn't when the trucation site is closer to the beginning of string_test.

Hopefully I can illustrate this better with the two examples below.

Example 1: shows expected behavior (5 letters before and after properly extracted)

# example 1: truncation at the end of 'MEGUSTA' 
peptide_test_1 <- "JUGAR"

str_locate(string_test, peptide_test_1)
     start end
[1,]     8  12

start_position <- str_locate(string_test, peptide_test_1)[, 1]
end_position <- str_locate(string_test, peptide_test_1)[, 2]


# 5 letters before truncation site
str_sub(string_test, start_position - 5, start_position - 1)
[1] "GUSTA"

# 5 letters after truncation site
str_sub(string_test, start_position, start_position + 4)
[1] "JUGAR"

Nevertheless, when the 'truncation site' is just at start == 2 of string_test, I get an empty result, instead of the expected behavior of getting the letter at position at start == 1. See the example code:

Example 2: truncation after first "M", shows unexpected behavior

# example 2: truncation after first "M"
peptide_test_2 <- "EGUSTA"

str_locate(string_test, peptide_test_2)
     start end
[1,]     2   7

start_position <- str_locate(string_test, peptide_test_2)[, 1]
end_position <- str_locate(string_test, peptide_test_2)[, 2]

# 5 AAs before truncation site
> str_sub(string_test, start_position - 5, start_position - 1) 
[1] ""

As you can see, I get "" instead of "M", which is the only letter before the 'truncation site'. I would expect to get "M" if it is the only letter before my 'truncation site'.

I would define this as unexpected behavior, but please let me know if I am missing something.

Thank you very much in advance for taking the time to check this. I will be very happy to receive your feedback on this.

Best wishes, Miguel

Session info:

R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22631)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8

time zone: Europe/Berlin
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] tibble_3.2.1  tidyr_1.3.0   stringr_1.5.1 dplyr_1.1.2   purrr_1.0.1  
[6] readr_2.1.4   here_1.0.1

loaded via a namespace (and not attached):
 [1] crayon_1.5.2     vctrs_0.6.3      cli_3.6.1        rlang_1.1.1
 [5] stringi_1.7.12   generics_0.1.3   seqinr_4.2-30    jsonlite_1.8.5
 [9] glue_1.6.2       bit_4.0.5        rprojroot_2.0.4  hms_1.1.3
[13] fansi_1.0.4      MASS_7.3-60      tzdb_0.4.0       lifecycle_1.0.4
[17] compiler_4.3.2   Rcpp_1.0.11      pkgconfig_2.0.3  R6_2.5.1
[21] tidyselect_1.2.0 utf8_1.2.3       parallel_4.3.2   vroom_1.6.4
[25] pillar_1.9.0     magrittr_2.0.3   withr_2.5.2      tools_4.3.2
[29] bit64_4.0.5      ade4_1.7-22

MiguelCos avatar Apr 17 '24 09:04 MiguelCos