datawizard `data_filter`: Add support for loop indices within functions?

Still within #301, I wonder if it would make sense to add support for loop indices within functions for data_filter, @etiennebacher?

library(datawizard)

df1 <- data.frame(
  id = c(1, 2, 3, 1, 3),
  item1 = c(NA, 1, 1, 2, 3),
  item2 = c(NA, 1, 1, 2, 3),
  item3 = c(NA, 1, 1, 2, 3)
)

# Attempt 1
fun <- function(data, id) {
  min.index <- NULL
  for (i in unique(data[[id]])) {
    min.index[i] <- 2
    x <- data_filter(data, item3 == min.index[i])
  }
  x
}
fun(df1, id = "id")
#> Error: Filtering did not work. Please check the syntax of your `filter`
#>   argument.

# Attempt 2, using quotes
fun <- function(data, id) {
  min.index <- NULL
  for (i in unique(data[[id]])) {
    min.index[i] <- 2
    x <- data_filter(data, "item3 == min.index[i]")
  }
  x
}
fun(df1, id = "id")
#> Error: Filtering did not work. Please check the syntax of your `filter`
#>   argument.

# Attempt 3, using curly brackets
fun <- function(data, id) {
  min.index <- NULL
  for (i in unique(data[[id]])) {
    min.index[i] <- 2
    x <- data_filter(data, item3 == min.index[{i}])
  }
  x
}
fun(df1, id = "id")
#> Error: Filtering did not work. Please check the syntax of your `filter`
#>   argument.

# Workaround is to create the index manually first
fun <- function(data, id) {
  min.index <- NULL
  for (i in unique(data[[id]])) {
    min.index[i] <- 2
    index <- which(data$item3 == min.index[i])
    x <- data_filter(data, index)
  }
  x
}
fun(df1, id = "id")
#>   id item1 item2 item3
#> 4  1     2     2     2

^{Created on 2022-11-05 with reprex v2.0.2}

Nov 06 '22 01:11 rempsyc

The problem is that data_filter() tries to evaluate the condition directly, whereas here we would like to first evaluate min.index[i] to get its value, and then filter based on this value.

Currently, if the evaluation fails in data_filter(), we check if the expression contains some curly brackets, and if it doesn't then we throw an error. This kind of situation means that we would also need to evaluate the RHS of the condition before evaluating the condition itself. There could be a solution but I think we could end up with a very messy code, as in .select_nse().

@strengejacke what do you think?

Nov 06 '22 09:11 etiennebacher

Yeah, .select_nse() works fine, but looks somehow "unmaintainable" due to its confusing complexity...

I'm not sure if in this particular case: data_filter(data, "item3 == min.index[i]"), it might be an issue of having the wrong environment when we evaluate the string? If so, there could be an "easy" solution, but these environment stuff, especially in combination with NSE, is still somewhat opaque to me.

Nov 06 '22 09:11 strengejacke

I tried to debug this issue. I saw that in code line:

https://github.com/easystats/datawizard/blob/9b2e2b5d49b15dd73a73f3b4aadbc081bc91921b/R/data_match.R#L209

.dynEval() returns NULL for the expression item3 == min.index[i].

When it comes to subsetting:

https://github.com/easystats/datawizard/blob/9b2e2b5d49b15dd73a73f3b4aadbc081bc91921b/R/data_match.R#L228-L233

symbol is item3 == min.index[i] and subset() errors at this point. Also simpler variants of the example-function do not work, like:

library(datawizard)

df1 <- data.frame(
  id = c(1, 2, 3, 1, 3),
  item1 = c(NA, 1, 1, 2, 3),
  item2 = c(NA, 1, 1, 2, 3),
  item3 = c(NA, 1, 1, 2, 3)
)

# Attempt 1
fun <- function(data, id) {
  min.index <- NULL
  for (i in unique(data[[id]])) {
    min.index <- 2
    x <- data_filter(data, item3 == min.index)
  }
  x
}
fun(df1, id = "id")
#> Error: Variable "min.index" was not found in the dataset.
#>   Possibly misspelled?

^{Created on 2023-06-16 with reprex v2.0.2}

Not sure how/if we can solve this?

Jun 16 '23 06:06 strengejacke