datawizard
datawizard copied to clipboard
Programming with `datawizard::data_filter` and character vectors
In the context of my easystats/performance#443 PR, I’ve experienced a difficulty using datawizard::data_filter
, so I am moving the discussion here.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(datawizard)
With the tidyverse, the recommendation is to use double curly brackets when passing a variable name as argument, e.g.,
fun.dp <- function(df, var) {
head(filter(df, {{var}} > 0.5))
}
fun.dp(mtcars, am)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#> Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
#> Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#> Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
In datawizard
, a naively copy-pasted strategy does not seem to work:
fun.dw <- function(df, var) {
head(data_filter(df, filter = {{var}} > 0.5))
}
fun.dw(mtcars, am)
#> Error in {: comparison (6) is possible only for atomic and list types
In any case, my needs might be more for passing a character vector. Therefore in dplyr
, we can use the double square brackets along the .data
argument:
fun.dp <- function(df, var) {
head(filter(df, .data[[var]] > 0.5))
}
fun.dp(mtcars, "am")
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#> Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
#> Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#> Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
In datawizard
, is there any way to achieve the same result?
fun.dw <- function(df, var) {
head(data_filter(df, filter = var > 0.5))
}
fun.dw(mtcars, "am")
#> Error in var > 0.5: comparison (6) is possible only for atomic and list types
fun.dw <- function(df, var) {
head(data_filter(df, filter = deparse(substitute(var)) > 0.5))
}
fun.dw(mtcars, "am")
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#> Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
#> Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# This doesn't throw an error but doesn't filter correctly
fun.dw <- function(df, var) {
head(data_filter(df, filter = !!var > 0.5))
}
fun.dw(mtcars, "am")
#> Error in var > 0.5: comparison (6) is possible only for atomic and list types
fun.dw <- function(df, var) {
head(data_filter(df, filter = .data[[var]] > 0.5))
}
fun.dw(mtcars, "am")
#> Error:
#> ! Can't subset `.data` outside of a data mask context.
fun.dw <- function(df, var) {
head(data_filter(df, filter = !!!var > 0.5))
}
fun.dw(mtcars, "am")
#> Error in var > 0.5: comparison (6) is possible only for atomic and list types
Is there a way to avoid having to do this?
fun.dw <- function(df, var) {
df$x <- df[[var]]
df <- head(data_filter(df, filter = x > 0.5))
data_remove(df, "x")
}
fun.dw(mtcars, "am")
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#> Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
#> Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#> Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
From the documentation I saw that data_filter
is based on subset
, which itself has the following warning:
This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like
[
, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.
Is there a workaround or some way to convert from string to an unevaluated promised expression (or something like that, I don’t really know the language)?
Created on 2022-08-12 by the reprex package (v2.0.1)
I'm not sure, maybe we convert the argument to string, and then replace any variable insight {}
with the value of the specified variable. Something like this:
var <- "vs"
x <- "{var} != 0 & am != 1"
if (grepl("{", x, fixed = TRUE)) {
variable <- gsub("(.*)\\{(.*)\\}(.*)", "\\2", x)
# eval(variable) = "var"
# strlang("var") = var (type language)
# eval(var) = "vs"
x <- gsub("\\{(.*)\\}", eval(str2lang(eval(variable))), x)
str(x)
}
#> chr "vs != 0 & am != 1"
do.call(subset, list(mtcars, subset = str2lang(x)))
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
#> Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
#> Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
#> Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
#> Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
#> Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
#> Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
The regexp needs to be fixed, because it doesn't correctly deal with more {}
, like:
var1 <- "vs"
var2 <- "am"
x <- "{var1} != 0 & {var2} != 1"
Anyone an idea how to extract string between {}
and don't look until last }
? @bwiernik ?
# correct
x <- "{var1} != 0 & am != 1"
gsub("(.*)\\{(.*)\\}(.*)", "\\2", x)
#> [1] "var1"
# expected: var1 and var2
x <- "{var1} != 0 & {var2} != 1"
gsub("(.*)\\{(.*)\\}(.*)", "\\2", x)
#> [1] "var2"
Created on 2022-08-13 by the reprex package (v2.0.1)
Not sure if this helps
Is this the most elegant way?
x <- "{var1} != 0 & {var2} != 1"
vars <- gregexpr("[^{\\}]+(?=\\})", x, perl = TRUE)
l <- attributes(vars[[1]])$match.length
vars <- unlist(vars)
sapply(seq_along(vars), function(i) {
substr(x, vars[i], vars[i] + l[i] - 1)
})
#> [1] "var1" "var2"
Created on 2022-08-13 by the reprex package (v2.0.1)
I'm not sure what your use case is, but maybe you could switch to data_match()
, if you need a filter-function for internal easystats-use?
Or use eval() and get()
mtcars[within(mtcars, eval(get("cyl") > 5)),]
data_match
seems to work well for an exactly equal value, but less so for greater than:
library(datawizard)
head(data_match(mtcars[c("mpg", "vs", "am")],
data.frame(am = 1),
match = "or"))
#> mpg vs am
#> Mazda RX4 21.0 0 1
#> Mazda RX4 Wag 21.0 0 1
#> Datsun 710 22.8 1 1
#> Fiat 128 32.4 1 1
#> Honda Civic 30.4 1 1
#> Toyota Corolla 33.9 1 1
head(data_match(mtcars[c("mpg", "vs", "am")],
data.frame(am > 0),
match = "or"))
#> Error in data.frame(am > 0): object 'am' not found
Created on 2022-08-13 by the reprex package (v2.0.1)
Or I'm not using it correctly. But I mean I'm fine with base R because I had a simple use case (keeping outliers with score greater than 0.5), I was just making an extra effort to integrate datawizard
as much as possible whenever possible. Here is what I've been actually using so far, e.g.,:
x[x[[Outlier_method]] >= 0.5,]
# Translated would be:
head(mtcars[mtcars["am"] >= 0.5,])
I agree that this issue is low priority though since base R works well.
Ok, I see. Then probably stick to base R.
Or I'm not using it correctly.
Yes, data_match
requires a "real" data frame with values that should be matched against, e.g.
head(data_match(mtcars[c("mpg", "vs", "am")],
data.frame(mpg = 50:max(mtcars$mpg), am = 1),
match = "or"))
or similar.