Raise error for invalid filtering using corpus_subset()/tokens_subset()/dfm_subset()
Describe the bug
When using one = for subsetting a corpus with corpus_subset, only a warning occurs. I think an error (including a comprehensible message) would be more appropriate (see the behaviour of filter() from dplyr below).
Reproducible code
library(quanteda)
#> Package version: 1.3.13
corp <- data_corpus_inaugural %>%
corpus_subset(President = "Obama")
#> Warning: President argument is not used in tryCatchList()
ndoc(corp)
#> [1] 58
corp2 <- data_corpus_inaugural %>%
corpus_subset(President == "Obama")
ndoc(corp2)
#> [1] 2
mtcars %>%
dplyr::filter(cyl = 2)
#> Error: `cyl` (`cyl = 2`) must not be named, do you need `==`?
Edit: of course, this problem also occurs with tokens_subset() and dfm_subset().
mydfm <- dfm(data_corpus_inaugural) %>%
dfm_subset(President = "Obama")
#> Warning: President argument is not used in tryCatchList()
nrow(mydfm)
#> [1] 58
toks <- tokens(data_corpus_inaugural) %>%
tokens_subset(President = "Obama")
#> Warning: President argument is not used in tryCatchList()
ndoc(toks)
#> [1] 58
Agreed, the error could be better. I think we should mimic base::subset() here, e.g. test the value for a logical, and issue an error if not:
> set.seed(1)
> df <- data.frame(id = paste0("item", 1:10),
+ var = sample(1:2, size = 10, replace = TRUE))
> df
id var
1 item1 1
2 item2 1
3 item3 2
4 item4 2
5 item5 1
6 item6 2
7 item7 2
8 item8 2
9 item9 2
10 item10 1
> subset(df, x = 1)
Error in subset.default(df, x = 1) : 'subset' must be logical
This also affects, in different ways, tokens_subset() and dfm_subset().
thetoks <- tokens(data_corpus_inaugural)
thedfm <- dfm(thetoks)
ndoc(tokens_subset(thetoks, President = "Obama"))
# [1] 58
# Warning message:
# President argument is not used in tokens_subset.tokens()
docnames(tokens_subset(thetoks, President == "Obama"))
# [1] "2009-Obama" "2013-Obama"
dfm_subset(thedfm, President = "Obama")
# Document-feature matrix of: 58 documents, 9,357 features (91.8% sparse).
# Warning message:
# President argument is not used in dfm_subset.dfm()
dfm_subset(thedfm, President == "Obama")
# Document-feature matrix of: 2 documents, 9,357 features (91% sparse).
I saw this issue and tried a really scrappy, rudimentary edit:
corpus_subset.corpus <- function(x, subset, select, ...) {
unused_dots(...)
#### I added this
e_char <- as.character(substitute(subset))
if (!("==" %in% e_char)) stop("subset cannot be an assigned variable do you need ==?")
####
r <- if (missing(subset)) {
rep_len(TRUE, nrow(documents(x)))
} else {
e <- substitute(subset)
r <- eval(e, documents(x), parent.frame())
r & !is.na(r)
}
vars <- if (missing(select))
TRUE
else {
nl <- as.list(seq_along(documents(x)))
names(nl) <- names(documents(x))
c(1, eval(substitute(select), nl, parent.frame()))
}
documents(x) <- documents(x)[r, vars, drop = FALSE]
if (is.corpuszip(x)) {
texts(x) <- texts(x)[r]
x$docnames <- rownames(documents(x))
}
x
}
Ran it and it gave me the following output, but I don't think this is a super robust fix.
> corp <- data_corpus_inaugural %>%
+ corpus_subset(President = "Obama")
Error in corpus_subset.corpus(., President = "Obama") : subset cannot be an assigned variable do you need ==?
Tried to go through the dplyr error handling code but its a maze. Happy to try help fix this with more direction though.