embed icon indicating copy to clipboard operation
embed copied to clipboard

`CppMethod` error when applying prepped UMAP recipe after saving/reading as `.rds`

Open juliasilge opened this issue 2 years ago • 2 comments

Seems like there is a bug 🐛 for step_umap() when trying to save a prepped recipe as .rds and reading it back to apply it new data.

library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
library(tidyverse)
library(embed)

split <- seq.int(1, 150, by = 9)
tr <- iris[-split, ]
te <- iris[ split, ]

set.seed(11)
supervised <- 
   recipe(Species ~ ., data = tr) %>%
   step_center(all_predictors()) %>% 
   step_scale(all_predictors()) %>% 
   step_umap(all_predictors(), outcome = vars(Species), num_comp = 2) %>% 
   prep(training = tr)

write_rds(supervised, here::here(tempdir(), "umap.rds"))
saved_rec <- read_rds(here::here(tempdir(), "umap.rds"))
saved_rec %>% bake(new_data = te)
#> Error in .External(structure(list(name = "CppMethod__invoke_notvoid", : NULL value passed as symbol address

Created on 2021-08-02 by the reprex package (v2.0.0)

I'm sure this is not us (i.e. not the embed package) but I wonder if there is anything we can do about this.

The recipe is fine if you don't save as .rds and then read it back.

juliasilge avatar Aug 02 '21 21:08 juliasilge

I am very late to discovering this, but yes this is almost certainly because of the underlying UMAP package (uwot), which uses RcppAnnoy, which itself wraps the C++ library Annoy to find approximate nearest neighbors. The RcppAnnoy objects have save and load methods that must be called and just using saveRDS with them won't work (at least I couldn't get it to work). In turn uwot needs to provide special functions to save and load its state but it's all very unsatisfactory. Sorry about that. I was unable to think of a workaround.

I do intend to fix this but my current solution involves writing an entirely new approximate nearest neighbors package. As that and maintaining uwot exists entirely as a spare time endeavor, it's taking quite a long time (3 years and counting for the nearest neighbor package). I'll get there in the end. Probably.

jlmelville avatar Mar 16 '22 06:03 jlmelville

Thanks for the message @jlmelville and for your work on uwot! 🙌 We also are thinking about serialization for trained model objects like xgboost, torch, etc, that have native methods for saving/loading. Definitely an area that needs some attention from all of us!

juliasilge avatar Mar 17 '22 23:03 juliasilge

This has now been solved with the new bundle package:

library(tidymodels)
library(tidyverse)
library(embed)

split <- seq.int(1, 150, by = 9)
tr <- iris[-split, ]
te <- iris[ split, ]

set.seed(11)
supervised <- 
  recipe(Species ~ ., data = tr) %>%
  step_center(all_predictors()) %>% 
  step_scale(all_predictors()) %>% 
  step_umap(all_predictors(), outcome = vars(Species), num_comp = 2) %>% 
  prep(training = tr)

library(bundle)
temp_file <- fs::file_temp(pattern = "umap", ext = "rds")
bundle(supervised) %>% write_rds(temp_file)

saved_rec <- read_rds(temp_file)
unbundle(saved_rec) %>% bake(new_data = te)
#> # A tibble: 17 × 3
#>    Species     UMAP1  UMAP2
#>    <fct>       <dbl>  <dbl>
#>  1 setosa      13.3    2.93
#>  2 setosa      12.0    4.69
#>  3 setosa      14.5    3.12
#>  4 setosa      13.5    3.07
#>  5 setosa      13.4    2.99
#>  6 setosa      12.0    4.86
#>  7 versicolor -10.1    8.80
#>  8 versicolor  -9.79   8.28
#>  9 versicolor  -4.91 -11.6 
#> 10 versicolor  -9.66   6.12
#> 11 versicolor -10.1    6.61
#> 12 versicolor -10.3    6.98
#> 13 virginica   -4.14 -11.6 
#> 14 virginica   -2.69 -12.1 
#> 15 virginica   -4.06 -10.3 
#> 16 virginica   -1.73 -11.5 
#> 17 virginica   -2.33 -10.9

Created on 2022-09-16 with reprex v2.0.2

We should document somewhere that this step needs to be bundled for use in a new session. How do you all want to do that?

juliasilge avatar Sep 16 '22 22:09 juliasilge

Looks like I need to get in on this bundle thing...

jlmelville avatar Sep 17 '22 00:09 jlmelville

I think we should document it as a section. Like we do with Tidying and Case weights, this way it will be easier to link to the documentation when the question pops up

EmilHvitfeldt avatar Sep 17 '22 00:09 EmilHvitfeldt

Agreed. We just did this for the parsnip engine docs.

topepo avatar Sep 22 '22 13:09 topepo

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

github-actions[bot] avatar Oct 08 '22 02:10 github-actions[bot]