recipes icon indicating copy to clipboard operation
recipes copied to clipboard

`has_role()` does not select columns for imputation in `step_impute_knn()`

Open andreranza opened this issue 9 months ago • 2 comments

The problem

I'm having trouble selecting columns to impute within step_impute_knn() using has_role(). Thanks!

Reproducible example

library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step

df <- tibble::tibble(
  country_code = c("AGO", "BGD", "BRA", "CHN", "PRK"),
  GDP = c(6930.7687, 35263.802, 8159000.64, 8485748, 9868.7669),
  D = c(32353588, 165516222, 211782878, 1407745000, 25755441),
  A = c(167, 1136, 2463, 2951, 367),
  B = c(3, NA, 5, NA, 7),
  C = c(13, NA, 5, NA, 4)
)

# imputation works
recipe(GDP ~ ., data = df) |>
  step_impute_knn(
    c("B", "C"), 
    neighbors = 2, 
    impute_with = c("D", "A")
  ) |> 
  prep() |> 
  juice()
#> # A tibble: 5 × 6
#>   country_code          D     A     B     C      GDP
#>   <fct>             <dbl> <dbl> <dbl> <dbl>    <dbl>
#> 1 AGO            32353588   167     3  13      6931.
#> 2 BGD           165516222  1136     5   8.5   35264.
#> 3 BRA           211782878  2463     5   5   8159001.
#> 4 CHN          1407745000  2951     6   4.5 8485748 
#> 5 PRK            25755441   367     7   4      9869.

# imputation does not work
recipe(GDP ~ ., data = df) |>
  add_role(D, new_role = "impute") |> 
  add_role(A, new_role = "impute") |> 
  step_impute_knn(
    c("B", "C"), 
    neighbors = 2, 
    impute_with = has_role("impute")
  ) |> 
  prep() |> 
  juice()
#> Warning: All predictors are missing; cannot impute
#> All predictors are missing; cannot impute
#> # A tibble: 5 × 6
#>   country_code          D     A     B     C      GDP
#>   <fct>             <dbl> <dbl> <dbl> <dbl>    <dbl>
#> 1 AGO            32353588   167     3    13    6931.
#> 2 BGD           165516222  1136    NA    NA   35264.
#> 3 BRA           211782878  2463     5     5 8159001.
#> 4 CHN          1407745000  2951    NA    NA 8485748 
#> 5 PRK            25755441   367     7     4    9869.

Created on 2023-09-07 with reprex v2.0.2

Session info
sessionInfo()
#> R version 4.2.3 (2023-03-15)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur ... 10.16
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] recipes_1.0.8 dplyr_1.1.0  
#> 
#> loaded via a namespace (and not attached):
#>  [1] styler_1.7.0        tidyselect_1.2.0    xfun_0.39          
#>  [4] purrr_1.0.1         listenv_0.9.0       splines_4.2.3      
#>  [7] lattice_0.20-45     vctrs_0.6.3         generics_0.1.3     
#> [10] htmltools_0.5.4     yaml_2.3.7          utf8_1.2.3         
#> [13] survival_3.5-3      prodlim_2023.08.28  rlang_1.1.1        
#> [16] R.oo_1.25.0         pillar_1.9.0        glue_1.6.2         
#> [19] withr_2.5.0         R.utils_2.12.0      R.cache_0.16.0     
#> [22] lifecycle_1.0.3     lava_1.7.2.1        timeDate_4022.108  
#> [25] R.methodsS3_1.8.2   future_1.33.0       codetools_0.2-19   
#> [28] evaluate_0.21       knitr_1.43          fastmap_1.1.1      
#> [31] parallel_4.2.3      class_7.3-21        fansi_1.0.4        
#> [34] Rcpp_1.0.10         ipred_0.9-14        parallelly_1.36.0  
#> [37] fs_1.6.2            digest_0.6.33       grid_4.2.3         
#> [40] hardhat_1.3.0       cli_3.6.1           tools_4.2.3        
#> [43] magrittr_2.0.3      tibble_3.2.1        future.apply_1.11.0
#> [46] pkgconfig_2.0.3     ellipsis_0.3.2      MASS_7.3-58.2      
#> [49] Matrix_1.5-3        data.table_1.14.8   timechange_0.2.0   
#> [52] lubridate_1.9.2     reprex_2.0.2        gower_1.0.1        
#> [55] rmarkdown_2.23      rstudioapi_0.15.0   R6_2.5.1           
#> [58] globals_0.16.2      rpart_4.1.19        nnet_7.3-18        
#> [61] compiler_4.2.3

andreranza avatar Sep 07 '23 21:09 andreranza

Hello @andreranza :wave: Thanks for the wonderful reprex!

As per the documentation for step_impute_knn.

You need to use the imp_vars() function to use selector functions such as has_role(). I want to be able to use has_role() directly in cases like this but it is not yet implemented.

library(recipes)

df <- tibble::tibble(
  country_code = c("AGO", "BGD", "BRA", "CHN", "PRK"),
  GDP = c(6930.7687, 35263.802, 8159000.64, 8485748, 9868.7669),
  D = c(32353588, 165516222, 211782878, 1407745000, 25755441),
  A = c(167, 1136, 2463, 2951, 367),
  B = c(3, NA, 5, NA, 7),
  C = c(13, NA, 5, NA, 4)
)

recipe(GDP ~ ., data = df) |>
  add_role(D, new_role = "impute") |> 
  add_role(A, new_role = "impute") |> 
  step_impute_knn(
    c("B", "C"), 
    neighbors = 2, 
    impute_with = imp_vars(has_role("impute"))
  ) |> 
  prep() |> 
  juice()
#> # A tibble: 5 × 6
#>   country_code          D     A     B     C      GDP
#>   <fct>             <dbl> <dbl> <dbl> <dbl>    <dbl>
#> 1 AGO            32353588   167     3  13      6931.
#> 2 BGD           165516222  1136     5   8.5   35264.
#> 3 BRA           211782878  2463     5   5   8159001.
#> 4 CHN          1407745000  2951     6   4.5 8485748 
#> 5 PRK            25755441   367     7   4      9869.

Created on 2023-09-07 with reprex v2.0.2

EmilHvitfeldt avatar Sep 07 '23 22:09 EmilHvitfeldt

Wow, I definitely saw imp_vars(). Unsure why I didn't try that out 😅 I guess it felt so natural to use it without that it should have worked despite what the documentation was saying. Sorry and thanks a lot for pointing in the right direction!

andreranza avatar Sep 08 '23 04:09 andreranza