quanteda icon indicating copy to clipboard operation
quanteda copied to clipboard

Raise error for invalid filtering using corpus_subset()/tokens_subset()/dfm_subset()

Open stefan-mueller opened this issue 7 years ago • 3 comments

Describe the bug

When using one = for subsetting a corpus with corpus_subset, only a warning occurs. I think an error (including a comprehensible message) would be more appropriate (see the behaviour of filter() from dplyr below).

Reproducible code

library(quanteda)
#> Package version: 1.3.13

corp <- data_corpus_inaugural %>% 
    corpus_subset(President = "Obama")
#> Warning: President argument is not used in tryCatchList()

ndoc(corp)
#> [1] 58

corp2 <- data_corpus_inaugural %>% 
    corpus_subset(President == "Obama")

ndoc(corp2)
#> [1] 2

mtcars %>% 
    dplyr::filter(cyl = 2)
#> Error: `cyl` (`cyl = 2`) must not be named, do you need `==`?

Edit: of course, this problem also occurs with tokens_subset() and dfm_subset().

mydfm <- dfm(data_corpus_inaugural) %>% 
    dfm_subset(President = "Obama")
#> Warning: President argument is not used in tryCatchList()
nrow(mydfm)
#> [1] 58

toks <- tokens(data_corpus_inaugural) %>% 
    tokens_subset(President = "Obama")
#> Warning: President argument is not used in tryCatchList()

ndoc(toks)
#> [1] 58

stefan-mueller avatar Nov 04 '18 21:11 stefan-mueller

Agreed, the error could be better. I think we should mimic base::subset() here, e.g. test the value for a logical, and issue an error if not:

> set.seed(1)
> df <- data.frame(id = paste0("item", 1:10), 
+                  var = sample(1:2, size = 10, replace = TRUE))
> df
       id var
1   item1   1
2   item2   1
3   item3   2
4   item4   2
5   item5   1
6   item6   2
7   item7   2
8   item8   2
9   item9   2
10 item10   1
> subset(df, x = 1)
Error in subset.default(df, x = 1) : 'subset' must be logical

kbenoit avatar Nov 08 '18 11:11 kbenoit

This also affects, in different ways, tokens_subset() and dfm_subset().

thetoks <- tokens(data_corpus_inaugural)
thedfm <- dfm(thetoks)

ndoc(tokens_subset(thetoks, President = "Obama"))
# [1] 58
# Warning message:
# President argument is not used in tokens_subset.tokens() 
docnames(tokens_subset(thetoks, President == "Obama"))
# [1] "2009-Obama" "2013-Obama"

dfm_subset(thedfm, President = "Obama")
# Document-feature matrix of: 58 documents, 9,357 features (91.8% sparse).
# Warning message:
# President argument is not used in dfm_subset.dfm() 
dfm_subset(thedfm, President == "Obama")
# Document-feature matrix of: 2 documents, 9,357 features (91% sparse).

kbenoit avatar Nov 08 '18 11:11 kbenoit

I saw this issue and tried a really scrappy, rudimentary edit:

corpus_subset.corpus <- function(x, subset, select, ...) {
    unused_dots(...)

#### I added this
    e_char <- as.character(substitute(subset))
    if (!("==" %in% e_char)) stop("subset cannot be an assigned variable do you need ==?")
####

    r <- if (missing(subset)) {
        rep_len(TRUE, nrow(documents(x)))
    } else {
        e <- substitute(subset)
        r <- eval(e, documents(x), parent.frame())
        r & !is.na(r)
    }
    vars <- if (missing(select)) 
        TRUE
    else {
        nl <- as.list(seq_along(documents(x)))
        names(nl) <- names(documents(x))
        c(1, eval(substitute(select), nl, parent.frame()))
    }
    
    documents(x) <- documents(x)[r, vars, drop = FALSE]
    if (is.corpuszip(x)) {
        texts(x) <- texts(x)[r]
        x$docnames <- rownames(documents(x))
    }
    
    x
}

Ran it and it gave me the following output, but I don't think this is a super robust fix.

> corp <- data_corpus_inaugural %>% 
+     corpus_subset(President = "Obama")
Error in corpus_subset.corpus(., President = "Obama") :  subset cannot be an assigned variable do you need ==? 

Tried to go through the dplyr error handling code but its a maze. Happy to try help fix this with more direction though.

jiongweilua avatar Jan 09 '19 19:01 jiongweilua