`has_role()` does not select columns for imputation in `step_impute_knn()`

Open andreranza opened this issue 9 months ago • 2 comments

The problem

I'm having trouble selecting columns to impute within step_impute_knn() using has_role(). Thanks!

Reproducible example

df <- tibble::tibble(
  country_code = c("AGO", "BGD", "BRA", "CHN", "PRK"),
  GDP = c(6930.7687, 35263.802, 8159000.64, 8485748, 9868.7669),
  D = c(32353588, 165516222, 211782878, 1407745000, 25755441),
  A = c(167, 1136, 2463, 2951, 367),
  B = c(3, NA, 5, NA, 7),
  C = c(13, NA, 5, NA, 4)

# imputation works
recipe(GDP ~ ., data = df) |>
    c("B", "C"), 
    neighbors = 2, 
    impute_with = c("D", "A")
  ) |> 
  prep() |> 
#> # A tibble: 5 × 6
#>   country_code          D     A     B     C      GDP
#>   <fct>             <dbl> <dbl> <dbl> <dbl>    <dbl>
#> 1 AGO            32353588   167     3  13      6931.
#> 2 BGD           165516222  1136     5   8.5   35264.
#> 3 BRA           211782878  2463     5   5   8159001.
#> 4 CHN          1407745000  2951     6   4.5 8485748 
#> 5 PRK            25755441   367     7   4      9869.

# imputation does not work
recipe(GDP ~ ., data = df) |>
  add_role(D, new_role = "impute") |> 
  add_role(A, new_role = "impute") |> 
    c("B", "C"), 
    neighbors = 2, 
    impute_with = has_role("impute")
  ) |> 
  prep() |> 
#> Warning: All predictors are missing; cannot impute
#> All predictors are missing; cannot impute
#> # A tibble: 5 × 6
#>   country_code          D     A     B     C      GDP
#>   <fct>             <dbl> <dbl> <dbl> <dbl>    <dbl>
#> 1 AGO            32353588   167     3    13    6931.
#> 2 BGD           165516222  1136    NA    NA   35264.
#> 3 BRA           211782878  2463     5     5 8159001.
#> 4 CHN          1407745000  2951    NA    NA 8485748 
#> 5 PRK            25755441   367     7     4    9869.

Session info
andreranza avatar Sep 07 '23 21:09 andreranza

Hello @andreranza :wave: Thanks for the wonderful reprex!

As per the documentation for step_impute_knn.

You need to use the imp_vars() function to use selector functions such as has_role(). I want to be able to use has_role() directly in cases like this but it is not yet implemented.


df <- tibble::tibble(
  country_code = c("AGO", "BGD", "BRA", "CHN", "PRK"),
  GDP = c(6930.7687, 35263.802, 8159000.64, 8485748, 9868.7669),
  D = c(32353588, 165516222, 211782878, 1407745000, 25755441),
  A = c(167, 1136, 2463, 2951, 367),
  B = c(3, NA, 5, NA, 7),
  C = c(13, NA, 5, NA, 4)

recipe(GDP ~ ., data = df) |>
  add_role(D, new_role = "impute") |> 
  add_role(A, new_role = "impute") |> 
    c("B", "C"), 
    neighbors = 2, 
    impute_with = imp_vars(has_role("impute"))
  ) |> 
  prep() |> 
#> # A tibble: 5 × 6
#>   country_code          D     A     B     C      GDP
#>   <fct>             <dbl> <dbl> <dbl> <dbl>    <dbl>
#> 1 AGO            32353588   167     3  13      6931.
#> 2 BGD           165516222  1136     5   8.5   35264.
#> 3 BRA           211782878  2463     5   5   8159001.
#> 4 CHN          1407745000  2951     6   4.5 8485748 
#> 5 PRK            25755441   367     7   4      9869.

EmilHvitfeldt avatar Sep 07 '23 22:09 EmilHvitfeldt

Wow, I definitely saw imp_vars(). Unsure why I didn't try that out 😅 I guess it felt so natural to use it without that it should have worked despite what the documentation was saying. Sorry and thanks a lot for pointing in the right direction!

andreranza avatar Sep 08 '23 04:09 andreranza