stringi
stringi copied to clipboard
vectorize_all for stri_detect_*
I very much like that option in stri_replace_*
and am wondering why stri_detect_*
does not have it.
I have built a function that does it for me and adds the functionality to combine matches with a logical operator, but It would be great to access this with full C++ speed.
str_detect_multi <- function (string, multi_pattern, empty_is_true = T, logical_combine_infix = `&`)
{
if (is.null(multi_pattern) & empty_is_true) {
return(rep(T, length(string)))
}
Reduce(logical_combine_infix, x = lapply(multi_pattern,
function(this_pattern) {
stri_detect_regex(string, this_pattern)
}))
}
Good idea, but this can also be easily implemented with the outer()
function + apply()
.
Hence, I'm not super convinced if it's really necessary to include yet another function.
I'd wish to keep the stringi API simple and avoid duplication.
To clarify: I was not suggesting using something like my hacky function but wondering if faster implementation in RCPP would make sense. I am also not sure what outer()
might add to this, since it just combines all elements of two arrays (via tcrossprod
), meaning it applies all patterns on each string and then you can combine them by a function - not too different from applying the patterns on the whole character vector and then running an aggregation function like my example. I would think that my variant is faster since it uses the fast vectorized approach for each single pattern on the whole string - or I might misunderstand your example. The inefficiency in both cases stems from using slow R loops to iterate over the patterns (or strings, whatever is shorter).
Let me state my argument a bit different: My use case is text processing of large character vectors to use in ML Models. If certain words of large wordlists appear in a text column, I can modify a feature column in my data.table. Therefore, using stri_detect_*
in the i
part of a data.table call combined with in-place modification :=
is extremely efficient even if you have billions of rows.
Since I use stri_replace_all_regex
in a pre-processing step for a similar task but with the vectorize_all
argument, I was wondering why that API is not used in the family of stri_detect_*
functions - I would argue that this is even somewhat of an inconsistency.
So in other words, you're advocating for a set of functions for:
- testing whether, for each string, there is a match to 1 of the patterns from a set of patterns?
- testing whether, each string matches all the patterns from a set of patterns?
Let's call them stri_match_any()
and stri_match_all()
, respectively.
Anything else?
Could you provide some "emulated" examples - like virtual calls on some specific inputs and the desired outputs you'd like to see?
Sure. I have added examples with word boundaries and stri_detect_regex
calls that achieve the same results. For the second case (str_match_all
) the regex needs additional operators to work independent of the order in which the patterns appear.
As you know, chaining regex patterns like this gets very slow very fast.
fruit <- c("banana pineapple", "apple banana pear", "applebanana pear")
# Case 1
stri_match_any(str = fruit, patterns = c("\\bbanana\\","\\bapple\\b"))
[1] TRUE TRUE FALSE
# Same as:
stri_detect_regex(fruit, "(\\bbanana\\b)|(\\bapple\\b)")
[1] TRUE TRUE FALSE
# Case 2
stri_match_all(str = fruit, patterns = c("\\bpear\\b","\\bapple\\b"))
[1] FALSE TRUE FALSE
# Same as
stri_detect_regex(fruit, "(?=.*\\bpear\\b)(?=.*\\bapple\\b)")
[1] FALSE TRUE FALSE
If this is only about searching for fixed patterns, possibly a Trie-like data structure could do the trick, especially if the number of patterns was large. Matching of whole words could be done using ICU's BreakIterator, internally.
At a first glance, I'm afraid that any other implementation will not be significantly more efficient than running stri_detect()
as many as length(patterns)
times and then combining the results with a cascade of &
s or |
s..., at least with regards to worst-case time complexity?
I mean, I kind of like the idea of these functions, generally.