datawizard icon indicating copy to clipboard operation
datawizard copied to clipboard

Programming with `datawizard::data_filter` and character vectors

Open rempsyc opened this issue 1 year ago • 7 comments

In the context of my easystats/performance#443 PR, I’ve experienced a difficulty using datawizard::data_filter, so I am moving the discussion here.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(datawizard)

With the tidyverse, the recommendation is to use double curly brackets when passing a variable name as argument, e.g.,

fun.dp <- function(df, var) {
  head(filter(df, {{var}} > 0.5))
}
fun.dp(mtcars, am)
#>                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
#> Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
#> Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
#> Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1

In datawizard, a naively copy-pasted strategy does not seem to work:

fun.dw <- function(df, var) {
  head(data_filter(df, filter = {{var}} > 0.5))
}
fun.dw(mtcars, am)
#> Error in {: comparison (6) is possible only for atomic and list types

In any case, my needs might be more for passing a character vector. Therefore in dplyr, we can use the double square brackets along the .data argument:

fun.dp <- function(df, var) {
  head(filter(df, .data[[var]] > 0.5))
}
fun.dp(mtcars, "am")
#>                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
#> Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
#> Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
#> Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1

In datawizard, is there any way to achieve the same result?

fun.dw <- function(df, var) {
  head(data_filter(df, filter = var > 0.5))
}
fun.dw(mtcars, "am")
#> Error in var > 0.5: comparison (6) is possible only for atomic and list types

fun.dw <- function(df, var) {
  head(data_filter(df, filter = deparse(substitute(var)) > 0.5))
}
fun.dw(mtcars, "am")
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
# This doesn't throw an error but doesn't filter correctly

fun.dw <- function(df, var) {
  head(data_filter(df, filter = !!var > 0.5))
}
fun.dw(mtcars, "am")
#> Error in var > 0.5: comparison (6) is possible only for atomic and list types

fun.dw <- function(df, var) {
  head(data_filter(df, filter = .data[[var]] > 0.5))
}
fun.dw(mtcars, "am")
#> Error:
#> ! Can't subset `.data` outside of a data mask context.

fun.dw <- function(df, var) {
  head(data_filter(df, filter = !!!var > 0.5))
}
fun.dw(mtcars, "am")
#> Error in var > 0.5: comparison (6) is possible only for atomic and list types

Is there a way to avoid having to do this?

fun.dw <- function(df, var) {
  df$x <- df[[var]]
  df <- head(data_filter(df, filter = x > 0.5))
  data_remove(df, "x")
}
fun.dw(mtcars, "am")
#>                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
#> Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
#> Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
#> Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1

From the documentation I saw that data_filter is based on subset, which itself has the following warning:

This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.

Is there a workaround or some way to convert from string to an unevaluated promised expression (or something like that, I don’t really know the language)?

Created on 2022-08-12 by the reprex package (v2.0.1)

rempsyc avatar Aug 13 '22 00:08 rempsyc

I'm not sure, maybe we convert the argument to string, and then replace any variable insight {} with the value of the specified variable. Something like this:

var <- "vs"
x <- "{var} != 0 & am != 1"
if (grepl("{", x, fixed = TRUE)) {
  variable <- gsub("(.*)\\{(.*)\\}(.*)", "\\2", x)
  # eval(variable) = "var"
  # strlang("var") = var (type language)
  # eval(var) = "vs"
  x <- gsub("\\{(.*)\\}", eval(str2lang(eval(variable))), x)
  str(x)
}
#>  chr "vs != 0 & am != 1"

do.call(subset, list(mtcars, subset = str2lang(x)))
#>                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
#> Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
#> Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
#> Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
#> Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
#> Merc 280C      17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
#> Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1

The regexp needs to be fixed, because it doesn't correctly deal with more {}, like:

var1 <- "vs"
var2 <- "am"
x <- "{var1} != 0 & {var2} != 1"

Anyone an idea how to extract string between {} and don't look until last }? @bwiernik ?

# correct
x <- "{var1} != 0 & am != 1"
gsub("(.*)\\{(.*)\\}(.*)", "\\2", x)
#> [1] "var1"

# expected: var1 and var2
x <- "{var1} != 0 & {var2} != 1"
gsub("(.*)\\{(.*)\\}(.*)", "\\2", x)
#> [1] "var2"

Created on 2022-08-13 by the reprex package (v2.0.1)

strengejacke avatar Aug 13 '22 09:08 strengejacke

Not sure if this helps

Screenshot 2022-08-13 at 12 55 47

IndrajeetPatil avatar Aug 13 '22 10:08 IndrajeetPatil

Is this the most elegant way?

x <- "{var1} != 0 & {var2} != 1"
vars <- gregexpr("[^{\\}]+(?=\\})", x, perl = TRUE)
l <- attributes(vars[[1]])$match.length
vars <- unlist(vars)
sapply(seq_along(vars), function(i) {
  substr(x, vars[i], vars[i] + l[i] - 1)
})
#> [1] "var1" "var2"

Created on 2022-08-13 by the reprex package (v2.0.1)

strengejacke avatar Aug 13 '22 11:08 strengejacke

I'm not sure what your use case is, but maybe you could switch to data_match(), if you need a filter-function for internal easystats-use?

strengejacke avatar Aug 13 '22 20:08 strengejacke

Or use eval() and get()

mtcars[within(mtcars, eval(get("cyl") > 5)),]

bwiernik avatar Aug 13 '22 21:08 bwiernik

data_match seems to work well for an exactly equal value, but less so for greater than:

library(datawizard)
head(data_match(mtcars[c("mpg", "vs", "am")],
                data.frame(am = 1),
                match = "or"))
#>                 mpg vs am
#> Mazda RX4      21.0  0  1
#> Mazda RX4 Wag  21.0  0  1
#> Datsun 710     22.8  1  1
#> Fiat 128       32.4  1  1
#> Honda Civic    30.4  1  1
#> Toyota Corolla 33.9  1  1

head(data_match(mtcars[c("mpg", "vs", "am")],
                data.frame(am > 0),
                match = "or"))
#> Error in data.frame(am > 0): object 'am' not found

Created on 2022-08-13 by the reprex package (v2.0.1) Or I'm not using it correctly. But I mean I'm fine with base R because I had a simple use case (keeping outliers with score greater than 0.5), I was just making an extra effort to integrate datawizard as much as possible whenever possible. Here is what I've been actually using so far, e.g.,:

x[x[[Outlier_method]] >= 0.5,]
# Translated would be:
head(mtcars[mtcars["am"] >= 0.5,])

I agree that this issue is low priority though since base R works well.

rempsyc avatar Aug 13 '22 21:08 rempsyc

Ok, I see. Then probably stick to base R.

Or I'm not using it correctly.

Yes, data_match requires a "real" data frame with values that should be matched against, e.g.

head(data_match(mtcars[c("mpg", "vs", "am")],
                data.frame(mpg = 50:max(mtcars$mpg), am = 1),
                match = "or"))

or similar.

strengejacke avatar Aug 13 '22 22:08 strengejacke