quanteda Make ngram wrapper for patterns?

It seems that users face problems when they work with ngrams: https://stackoverflow.com/questions/46685498/remove-ngrams-with-leading-and-trailing-stopwords

Then, how about making a ngram wrapper similar to phrase that basically does this for users:

pattern = c(paste0("^", stopwords("english"), "_"),  paste0("_", stopwords("english"), "$"))

We have something similar as an internal function for dfm_select only, but it can be exported so that users can user on tokens too.

## convert patterns (remove and select) to ngram regular expressions
make_ngram_pattern <- function(features, valuetype, concatenator) {
    if (valuetype == "glob") {
        features <- stri_replace_all_regex(features, "\\*", ".*")
        features <- stri_replace_all_regex(features, "\\?", ".{1}")
    }
    features <- paste0("(\\b|(\\w+", concatenator, ")+)", 
                       features, "(\\b|(", concatenator, "\\w+)+)")
    features
}

Oct 12 '17 08:10 koheiw

I'm thinking a way to implement this would be through some modifier to pattern(), where we have a "mask" of logicals corresponding to each ngram. So in the above example, we would have

dfm_remove(x, pattern = phrase(stopwords("english"), mask = c(TRUE, FALSE, TRUE)))

or something like that. Since we store concatenator in the attributes of x, we do not need the user to supply the character used to demarcate the individual "grams" of the ngram tokens or features.

This would also be very useful in filtering collocations.

Oct 12 '17 09:10 kbenoit

I like the mask idea, but we need to generalize it a bit more to allow selection of ngrams and collocations with different length. I will also think about it.

Oct 12 '17 12:10 koheiw

quanteda quanteda copied to clipboard

Make ngram wrapper for patterns?

quanteda
quanteda copied to clipboard