juliasilge.com
juliasilge.com copied to clipboard
PCA and UMAP with tidymodels and #TidyTuesday cocktail recipes | Julia Silge
PCA and UMAP with tidymodels and #TidyTuesday cocktail recipes | Julia Silge
Use tidymodels for unsupervised dimensionality reduction.
Thank you so much Julia. I think this video and content is great as intuitive explanation of PCA and how to implement and visualize it well in RStudio.
Hi Julia, this tidy workflow is very interesting and I am using it more and more. I also tried the UMAP workflow, but how to predict umap coordinates on a new set of data? In your example if I bake umap_prep on a different dataset (with the same variables) does not work, neither using standard 'predict' function. Am I doing something wrong or is not possible to predict/bake on a new set?
@portolan75 Is it this problem that you are seeing? Or something else?
If it is something else, then I suggest that you create a reprex (a minimal reproducible example) for the problem you are observing, and post it on RStudio Community. The goal of a reprex is to make it easier for us to recreate your problem so that others can understand it.
If you've never heard of a reprex before, you may want to start with the tidyverse.org help page. You may already have reprex installed (it comes with the tidyverse package), but if not you can install it with:
install.packages("reprex")
Thanks! 🙌
Hi @juliasilge , thanks for your answer. In reality after your comment I tried again and realised I did something wrong with my dataset and was not able to 'predict' - bake on the test set. So I was having good results on the training set but not able to bake the umap coeffs for the test set. Anyway it worked, thanks for the attention and also for re-directing to the other 'CppMethod' problem which turned useful as well.
All this work is so brilliant @juliasilge. Are there are any literature, book chapters, articles, videos on PCA interpretation you can recommend?
- One blog post + conference talk that I personally did is this one, using Stack Overflow data.
- I like this Cross Validated answer.
- This interactive explanation from setosa.io is one I often come back to.
Thank you for the fantastic tutorial but I have a question, how can we change the rotation method applied to the step_pca?
@Kasramhdz The step_pca()
function uses stats::prcomp()
under the hood, which I don't believe supports that, but you can get out the loadings using tidy()
and the type = "coef"
argument and then apply a rotation yourself. See this Cross Validated answer for more info.
I have another question,
I'm new to tidymodels but apparently the step_pca()
arguments such as nom_comp
or threshold are not being implemented when being trained. as in example below, I'm still getting 4 component despite setting nom_comp = 2
.
rec <- recipe( ~ ., data = USArrests) %>% step_normalize(all_numeric()) %>% step_pca(all_numeric(), num_comp = 2)
prep(rec) %>% tidy(number = 2, type = "coef") %>% pivot_wider(names_from = component, values_from = value, id_cols = terms)
@Kasramhdz The full PCA is determined (so you can still compute the variances of each term) and num_comp
specifies how many of the components are retained as predictors. If you want to specify the maximal rank, you can pass that through options
:
library(recipes)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#>
#> step
rec <- recipe( ~ ., data = USArrests) %>%
step_normalize(all_numeric()) %>%
step_pca(all_numeric(), num_comp = 2, options = list(rank. = 2))
prep(rec) %>% tidy(number = 2, type = "coef")
#> # A tibble: 8 × 4
#> terms value component id
#> <chr> <dbl> <chr> <chr>
#> 1 Murder -0.536 PC1 pca_T11OM
#> 2 Assault -0.583 PC1 pca_T11OM
#> 3 UrbanPop -0.278 PC1 pca_T11OM
#> 4 Rape -0.543 PC1 pca_T11OM
#> 5 Murder 0.418 PC2 pca_T11OM
#> 6 Assault 0.188 PC2 pca_T11OM
#> 7 UrbanPop -0.873 PC2 pca_T11OM
#> 8 Rape -0.167 PC2 pca_T11OM
Created on 2022-01-12 by the reprex package (v2.0.1)
You could also control this via the tol
argument.