quanteda
quanteda copied to clipboard
Make ngram wrapper for patterns?
It seems that users face problems when they work with ngrams: https://stackoverflow.com/questions/46685498/remove-ngrams-with-leading-and-trailing-stopwords
Then, how about making a ngram
wrapper similar to phrase
that basically does this for users:
pattern = c(paste0("^", stopwords("english"), "_"), paste0("_", stopwords("english"), "$"))
We have something similar as an internal function for dfm_select
only, but it can be exported so that users can user on tokens too.
## convert patterns (remove and select) to ngram regular expressions
make_ngram_pattern <- function(features, valuetype, concatenator) {
if (valuetype == "glob") {
features <- stri_replace_all_regex(features, "\\*", ".*")
features <- stri_replace_all_regex(features, "\\?", ".{1}")
}
features <- paste0("(\\b|(\\w+", concatenator, ")+)",
features, "(\\b|(", concatenator, "\\w+)+)")
features
}
I'm thinking a way to implement this would be through some modifier to pattern()
, where we have a "mask" of logicals corresponding to each ngram. So in the above example, we would have
dfm_remove(x, pattern = phrase(stopwords("english"), mask = c(TRUE, FALSE, TRUE)))
or something like that. Since we store concatenator
in the attributes of x
, we do not need the user to supply the character used to demarcate the individual "grams" of the ngram tokens or features.
This would also be very useful in filtering collocations.
I like the mask idea, but we need to generalize it a bit more to allow selection of ngrams and collocations with different length. I will also think about it.