recipes
recipes copied to clipboard
Feature Request: Recipe Step(s) to compute group-meaned and de-meaned variables
Now that {multilevelmod} is on CRAN, it would be great if we can have one or more recipe steps that compute group-meaned and de-meaned variables like the new {datawizard} functions demean(), degroup(), and detrend() do.
This will be useful when we want to model the between and within subject effects.
For more info about the use of this functionality, please see the documentation for demean() by running ?datawizard::demean .
Thank you for all your work!
I was looking through those docs and am wondering if you can lay out in a bit more detail how you might expect this to behave. For example, a new step to compute the mean per group? Would you want this to behave differently than embed::step_lencode_glm() does with, say, one nominal predictor and using the variable you want to "group-mean" as the outcome? And would a change to step_center() to allow passing in a grouping variable give you the "de-meaned" version?
So this is how it would be done with dplyr for one variable. I think there could be one recipe step for each of the mutate functions below.
Yes, step_center() with a grouping variable could replace the 2nd mutate and we'd need to introduce the mean (create the "_between" variable(s)) in a previous step (the 1st mutate) before the original variable(s) is/are removed by step_center().
Also, for the recipe step replacing the 1st mutate, we'd need to have a name suffix argument too (in addition to a grouping argument) as we might want to name the variable(s) "_contextual" instead of "_between" if we won't be centering within clusters in the next step (eg. if we were using grand mean centering later or not centering at all).
library(dplyr)
iris_demeaned <- iris %>%
select(Petal.Length, Species) %>%
group_by(Species) %>%
mutate(Petal.Length_between = mean(Petal.Length)) %>% # recipe step 1 with grouping & name suffix arguments
mutate(Petal.Length_within = Petal.Length - mean(Petal.Length), .keep = "unused") %>% # recipe step 2 with grouping & name suffix arguments
ungroup()
iris_demeaned
#> # A tibble: 150 × 3
#> Species Petal.Length_between Petal.Length_within
#> <fct> <dbl> <dbl>
#> 1 setosa 1.46 -0.0620
#> 2 setosa 1.46 -0.0620
#> 3 setosa 1.46 -0.162
#> 4 setosa 1.46 0.0380
#> 5 setosa 1.46 -0.0620
#> 6 setosa 1.46 0.238
#> 7 setosa 1.46 -0.0620
#> 8 setosa 1.46 0.0380
#> 9 setosa 1.46 -0.0620
#> 10 setosa 1.46 0.0380
#> # … with 140 more rows
Out of curiosity. Would there be a need for 2 steps or could be combine into one step? rephrased differently, would there be a time where you would want to calculate the _between variable without the _within variable?
Yes there would need to be 2 steps because there are various scenarios you may want to do:
- You may want to keep the variable uncentered and introduce the per context/cluster mean. In that case I would rename the variable itself as _within while calling the per context/cluster mean as _contextual as it would represent the contextual effect.
iris %>%
select(Petal.Length, Species) %>%
rename(Petal.Length_within = Petal.Length) %>%
group_by(Species) %>%
mutate(Petal.Length_contextual = mean(Petal.Length_within)) %>%
ungroup()
- You may want to center the variable using the grand mean and introduce the per context/cluster mean. In that case I would call the grand-mean centered variable as _within while calling the per context/cluster mean as _contextual as it would represent the contextual effect.
iris %>%
select(Petal.Length, Species) %>%
mutate(Petal.Length_within = Petal.Length - mean(Petal.Length), .keep = "unused") %>% # Uses the grand mean for centering
group_by(Species) %>%
mutate(Petal.Length_contextual = mean(Petal.Length_within)) %>%
ungroup()
- The scenario shown in my earlier post which centers within context/cluster and introduces the per context/cluster mean. I slightly reordered the steps below to look similar to the above scenarios. In this case the variable centered within context/cluster would be called _within while the per context/cluster mean would be called _between as it would represent the between effect. Note that in this scenario we introduce the per context/cluster mean of the 'non-centered within context/cluster' variable, otherwise all the introduced per context/cluster means would have a value of 0.
iris %>%
select(Petal.Length, Species) %>%
group_by(Species) %>%
mutate(Petal.Length_within = Petal.Length - mean(Petal.Length)) %>%
mutate(Petal.Length_between = mean(Petal.Length), .keep = "unused") %>%
ungroup()
Basically, the interpretation of the introduced per context/cluster mean depends on the type of centering done to the variable. If the variable was uncentered or centered using the grand-mean, then the introduced per context/cluster mean represents the contextual effect. If the variable will be centered within context/cluster, then the introduced per context/cluster mean represents the between effect.
Here is a good paper on the topic: http://quantpsy.org/pubs/yaremych_preacher_hedeker_(in.press).pdf
Something similar has been implemented in the tft package (step_group_normalize).
TBH I'm worried that this could be misused. For example, grouping on the outcome class etc.
I think that it's a good idea that should live in an extension package.