stringi icon indicating copy to clipboard operation
stringi copied to clipboard

vectorize_all for stri_detect_*

Open zerweck opened this issue 3 years ago • 6 comments

I very much like that option in stri_replace_* and am wondering why stri_detect_* does not have it.

I have built a function that does it for me and adds the functionality to combine matches with a logical operator, but It would be great to access this with full C++ speed.

str_detect_multi <- function (string, multi_pattern, empty_is_true = T, logical_combine_infix = `&`) 
{
  if (is.null(multi_pattern) & empty_is_true) {
    return(rep(T, length(string)))
  }
  Reduce(logical_combine_infix, x = lapply(multi_pattern, 
    function(this_pattern) {
      stri_detect_regex(string, this_pattern)
    }))
}

zerweck avatar Sep 14 '20 12:09 zerweck

Good idea, but this can also be easily implemented with the outer() function + apply(). Hence, I'm not super convinced if it's really necessary to include yet another function. I'd wish to keep the stringi API simple and avoid duplication.

gagolews avatar Sep 15 '20 01:09 gagolews

To clarify: I was not suggesting using something like my hacky function but wondering if faster implementation in RCPP would make sense. I am also not sure what outer() might add to this, since it just combines all elements of two arrays (via tcrossprod), meaning it applies all patterns on each string and then you can combine them by a function - not too different from applying the patterns on the whole character vector and then running an aggregation function like my example. I would think that my variant is faster since it uses the fast vectorized approach for each single pattern on the whole string - or I might misunderstand your example. The inefficiency in both cases stems from using slow R loops to iterate over the patterns (or strings, whatever is shorter).

Let me state my argument a bit different: My use case is text processing of large character vectors to use in ML Models. If certain words of large wordlists appear in a text column, I can modify a feature column in my data.table. Therefore, using stri_detect_* in the i part of a data.table call combined with in-place modification := is extremely efficient even if you have billions of rows.

Since I use stri_replace_all_regex in a pre-processing step for a similar task but with the vectorize_all argument, I was wondering why that API is not used in the family of stri_detect_* functions - I would argue that this is even somewhat of an inconsistency.

zerweck avatar Sep 15 '20 07:09 zerweck

So in other words, you're advocating for a set of functions for:

  1. testing whether, for each string, there is a match to 1 of the patterns from a set of patterns?
  2. testing whether, each string matches all the patterns from a set of patterns?

Let's call them stri_match_any() and stri_match_all(), respectively. Anything else?

Could you provide some "emulated" examples - like virtual calls on some specific inputs and the desired outputs you'd like to see?

gagolews avatar Sep 16 '20 00:09 gagolews

Sure. I have added examples with word boundaries and stri_detect_regex calls that achieve the same results. For the second case (str_match_all) the regex needs additional operators to work independent of the order in which the patterns appear. As you know, chaining regex patterns like this gets very slow very fast.

fruit <- c("banana pineapple", "apple banana pear", "applebanana pear")

# Case 1
stri_match_any(str = fruit, patterns = c("\\bbanana\\","\\bapple\\b"))
[1]  TRUE  TRUE FALSE
# Same as:
stri_detect_regex(fruit, "(\\bbanana\\b)|(\\bapple\\b)")
[1]  TRUE  TRUE FALSE

# Case 2
stri_match_all(str = fruit, patterns = c("\\bpear\\b","\\bapple\\b"))
[1] FALSE  TRUE FALSE
# Same as
stri_detect_regex(fruit, "(?=.*\\bpear\\b)(?=.*\\bapple\\b)")
[1] FALSE  TRUE FALSE

zerweck avatar Sep 16 '20 07:09 zerweck

If this is only about searching for fixed patterns, possibly a Trie-like data structure could do the trick, especially if the number of patterns was large. Matching of whole words could be done using ICU's BreakIterator, internally.

At a first glance, I'm afraid that any other implementation will not be significantly more efficient than running stri_detect() as many as length(patterns) times and then combining the results with a cascade of &s or |s..., at least with regards to worst-case time complexity?

gagolews avatar Sep 16 '20 09:09 gagolews

I mean, I kind of like the idea of these functions, generally.

gagolews avatar Sep 16 '20 09:09 gagolews