quanteda icon indicating copy to clipboard operation
quanteda copied to clipboard

Make ngram wrapper for patterns?

Open koheiw opened this issue 7 years ago • 2 comments

It seems that users face problems when they work with ngrams: https://stackoverflow.com/questions/46685498/remove-ngrams-with-leading-and-trailing-stopwords

Then, how about making a ngram wrapper similar to phrase that basically does this for users:

pattern = c(paste0("^", stopwords("english"), "_"),  paste0("_", stopwords("english"), "$"))

We have something similar as an internal function for dfm_select only, but it can be exported so that users can user on tokens too.

## convert patterns (remove and select) to ngram regular expressions
make_ngram_pattern <- function(features, valuetype, concatenator) {
    if (valuetype == "glob") {
        features <- stri_replace_all_regex(features, "\\*", ".*")
        features <- stri_replace_all_regex(features, "\\?", ".{1}")
    }
    features <- paste0("(\\b|(\\w+", concatenator, ")+)", 
                       features, "(\\b|(", concatenator, "\\w+)+)")
    features
}

koheiw avatar Oct 12 '17 08:10 koheiw

I'm thinking a way to implement this would be through some modifier to pattern(), where we have a "mask" of logicals corresponding to each ngram. So in the above example, we would have

dfm_remove(x, pattern = phrase(stopwords("english"), mask = c(TRUE, FALSE, TRUE)))

or something like that. Since we store concatenator in the attributes of x, we do not need the user to supply the character used to demarcate the individual "grams" of the ngram tokens or features.

This would also be very useful in filtering collocations.

kbenoit avatar Oct 12 '17 09:10 kbenoit

I like the mask idea, but we need to generalize it a bit more to allow selection of ngrams and collocations with different length. I will also think about it.

koheiw avatar Oct 12 '17 12:10 koheiw