recipes
recipes copied to clipboard
`has_role()` does not select columns for imputation in `step_impute_knn()`
The problem
I'm having trouble selecting columns to impute within step_impute_knn()
using has_role()
. Thanks!
Reproducible example
library(recipes)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#>
#> step
df <- tibble::tibble(
country_code = c("AGO", "BGD", "BRA", "CHN", "PRK"),
GDP = c(6930.7687, 35263.802, 8159000.64, 8485748, 9868.7669),
D = c(32353588, 165516222, 211782878, 1407745000, 25755441),
A = c(167, 1136, 2463, 2951, 367),
B = c(3, NA, 5, NA, 7),
C = c(13, NA, 5, NA, 4)
)
# imputation works
recipe(GDP ~ ., data = df) |>
step_impute_knn(
c("B", "C"),
neighbors = 2,
impute_with = c("D", "A")
) |>
prep() |>
juice()
#> # A tibble: 5 × 6
#> country_code D A B C GDP
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 AGO 32353588 167 3 13 6931.
#> 2 BGD 165516222 1136 5 8.5 35264.
#> 3 BRA 211782878 2463 5 5 8159001.
#> 4 CHN 1407745000 2951 6 4.5 8485748
#> 5 PRK 25755441 367 7 4 9869.
# imputation does not work
recipe(GDP ~ ., data = df) |>
add_role(D, new_role = "impute") |>
add_role(A, new_role = "impute") |>
step_impute_knn(
c("B", "C"),
neighbors = 2,
impute_with = has_role("impute")
) |>
prep() |>
juice()
#> Warning: All predictors are missing; cannot impute
#> All predictors are missing; cannot impute
#> # A tibble: 5 × 6
#> country_code D A B C GDP
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 AGO 32353588 167 3 13 6931.
#> 2 BGD 165516222 1136 NA NA 35264.
#> 3 BRA 211782878 2463 5 5 8159001.
#> 4 CHN 1407745000 2951 NA NA 8485748
#> 5 PRK 25755441 367 7 4 9869.
Created on 2023-09-07 with reprex v2.0.2
Session info
sessionInfo()
#> R version 4.2.3 (2023-03-15)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur ... 10.16
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] recipes_1.0.8 dplyr_1.1.0
#>
#> loaded via a namespace (and not attached):
#> [1] styler_1.7.0 tidyselect_1.2.0 xfun_0.39
#> [4] purrr_1.0.1 listenv_0.9.0 splines_4.2.3
#> [7] lattice_0.20-45 vctrs_0.6.3 generics_0.1.3
#> [10] htmltools_0.5.4 yaml_2.3.7 utf8_1.2.3
#> [13] survival_3.5-3 prodlim_2023.08.28 rlang_1.1.1
#> [16] R.oo_1.25.0 pillar_1.9.0 glue_1.6.2
#> [19] withr_2.5.0 R.utils_2.12.0 R.cache_0.16.0
#> [22] lifecycle_1.0.3 lava_1.7.2.1 timeDate_4022.108
#> [25] R.methodsS3_1.8.2 future_1.33.0 codetools_0.2-19
#> [28] evaluate_0.21 knitr_1.43 fastmap_1.1.1
#> [31] parallel_4.2.3 class_7.3-21 fansi_1.0.4
#> [34] Rcpp_1.0.10 ipred_0.9-14 parallelly_1.36.0
#> [37] fs_1.6.2 digest_0.6.33 grid_4.2.3
#> [40] hardhat_1.3.0 cli_3.6.1 tools_4.2.3
#> [43] magrittr_2.0.3 tibble_3.2.1 future.apply_1.11.0
#> [46] pkgconfig_2.0.3 ellipsis_0.3.2 MASS_7.3-58.2
#> [49] Matrix_1.5-3 data.table_1.14.8 timechange_0.2.0
#> [52] lubridate_1.9.2 reprex_2.0.2 gower_1.0.1
#> [55] rmarkdown_2.23 rstudioapi_0.15.0 R6_2.5.1
#> [58] globals_0.16.2 rpart_4.1.19 nnet_7.3-18
#> [61] compiler_4.2.3
Hello @andreranza :wave: Thanks for the wonderful reprex!
As per the documentation for step_impute_knn.
You need to use the imp_vars()
function to use selector functions such as has_role()
. I want to be able to use has_role()
directly in cases like this but it is not yet implemented.
library(recipes)
df <- tibble::tibble(
country_code = c("AGO", "BGD", "BRA", "CHN", "PRK"),
GDP = c(6930.7687, 35263.802, 8159000.64, 8485748, 9868.7669),
D = c(32353588, 165516222, 211782878, 1407745000, 25755441),
A = c(167, 1136, 2463, 2951, 367),
B = c(3, NA, 5, NA, 7),
C = c(13, NA, 5, NA, 4)
)
recipe(GDP ~ ., data = df) |>
add_role(D, new_role = "impute") |>
add_role(A, new_role = "impute") |>
step_impute_knn(
c("B", "C"),
neighbors = 2,
impute_with = imp_vars(has_role("impute"))
) |>
prep() |>
juice()
#> # A tibble: 5 × 6
#> country_code D A B C GDP
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 AGO 32353588 167 3 13 6931.
#> 2 BGD 165516222 1136 5 8.5 35264.
#> 3 BRA 211782878 2463 5 5 8159001.
#> 4 CHN 1407745000 2951 6 4.5 8485748
#> 5 PRK 25755441 367 7 4 9869.
Created on 2023-09-07 with reprex v2.0.2
Wow, I definitely saw imp_vars(). Unsure why I didn't try that out 😅 I guess it felt so natural to use it without that it should have worked despite what the documentation was saying. Sorry and thanks a lot for pointing in the right direction!