infer
infer copied to clipboard
rep_slice_sample on groups with multiple n values
Hello package maintainers! I am building confidence intervals for groups with bootstrapped values and I'm having trouble creating multiple re-sampled datasets from which to build my confidence intervals.
Using the palmerpenguins library as an example:
library(tidyverse)
library(infer)
library(palmerpenguins)
There are 344 total observations and each species has a different number of observations:
nrow(penguins)
# [1] 344
penguins %>% group_by(species) %>% count()
# A tibble: 3 × 2
# Groups: species [3]
# species n
<fct> <int>
#1 Adelie 152
#2 Chinstrap 68
#3 Gentoo 124
I want to be able to group by the species, and for each species pull multiple samples while using the original number of observations per each group.
set.seed(100)
slices <- penguins2 %>%
group_by(species) %>%
rep_slice_sample(prop = 1, replace = TRUE, reps = 10)
That should give me 344 * 10 = 3440 lines in the full new data set. This is true, but when you look at the data you can see that each replicate has a different number of observations. For all of the Adelie, n per sample should be 152, chinstrap should be 68, and Gentoo should be 124. Instead we find this:
slices %>% group_by(species, replicate) %>% count()
# A tibble: 30 × 3
# Groups: species, replicate [30]
# species replicate n
# <fct> <int> <int>
#1 Adelie 1 148
#2 Adelie 2 147
# 3 Adelie 3 148
# 4 Adelie 4 151
# 5 Adelie 5 138
# 6 Adelie 6 157
# 7 Adelie 7 161
# 8 Adelie 8 157
# 9 Adelie 9 151
#10 Adelie 10 138
# ℹ 20 more rows
# ℹ Use `print(n = ...)` to see more rows
What am I missing? thanks for your insight.
Hello adrie-stclair,
I'm not one of the package maintainers, but your question links to a question I was considering this weekend to put up here.
Let's say I have a dataset which is rather unbalanced with regards to the explanatory variable and I draw bootstrap samples from this dataset. I could end up with many bootstrap samples which contain no cases from the minority class. If I then want to calculate a (for example) diff in props statistic from these samples I end up with many NaN values. I can easily drop these NaN samples from my analyses, in fact, the get_ci and visualise functions do this automatically, but is makes me wonder if a stratified argument would be useful for the generate function.
I hope the package maintainers or authers could weight in on the question above and my related question.
I added a code-example below.
library(dplyr)
library(moderndive)
library(infer)
set.seed(123)
promo_fem <- promotions |>
filter(gender == "female") |>
slice_sample(n = 3)
promo <- promotions |>
mutate(gender = "male")
promo <- bind_rows(promo, promo_fem)
table(promo$gender, promo$decision)
promo_bootstrap <- promo |>
specify(decision ~ gender, success = "promoted") |>
generate(5000, type = "bootstrap") |>
calculate("diff in props", order = c("male", "female"))
promo_bootstrap |>
filter(is.nan(stat)) |>
nrow()
promo_bootstrap_ci <- promo_bootstrap |>
get_confidence_interval()
visualise(promo_bootstrap) +
shade_ci(promo_bootstrap_ci)
Related to #503, #197. :)