infer icon indicating copy to clipboard operation
infer copied to clipboard

rep_slice_sample on groups with multiple n values

Open adrie-stclair opened this issue 1 year ago • 2 comments

Hello package maintainers! I am building confidence intervals for groups with bootstrapped values and I'm having trouble creating multiple re-sampled datasets from which to build my confidence intervals.

Using the palmerpenguins library as an example:

library(tidyverse)
library(infer)
library(palmerpenguins)

There are 344 total observations and each species has a different number of observations:

nrow(penguins)
# [1] 344

penguins %>% group_by(species) %>% count()

# A tibble: 3 × 2
# Groups:   species [3]
#  species       n
  <fct>     <int>
#1 Adelie      152
#2 Chinstrap    68
#3 Gentoo      124

I want to be able to group by the species, and for each species pull multiple samples while using the original number of observations per each group.

set.seed(100)

slices <- penguins2 %>% 
    group_by(species) %>% 
    rep_slice_sample(prop = 1, replace = TRUE, reps = 10)

That should give me 344 * 10 = 3440 lines in the full new data set. This is true, but when you look at the data you can see that each replicate has a different number of observations. For all of the Adelie, n per sample should be 152, chinstrap should be 68, and Gentoo should be 124. Instead we find this:

slices %>% group_by(species, replicate) %>% count()

# A tibble: 30 × 3
# Groups:   species, replicate [30]
#   species replicate     n
#   <fct>       <int> <int>
#1 Adelie          1   148
#2 Adelie          2   147
# 3 Adelie          3   148
# 4 Adelie          4   151
# 5 Adelie          5   138
# 6 Adelie          6   157
# 7 Adelie          7   161
# 8 Adelie          8   157
# 9 Adelie          9   151
#10 Adelie         10   138
# ℹ 20 more rows
# ℹ Use `print(n = ...)` to see more rows

What am I missing? thanks for your insight.

adrie-stclair avatar Mar 25 '24 04:03 adrie-stclair

Hello adrie-stclair,

I'm not one of the package maintainers, but your question links to a question I was considering this weekend to put up here.

Let's say I have a dataset which is rather unbalanced with regards to the explanatory variable and I draw bootstrap samples from this dataset. I could end up with many bootstrap samples which contain no cases from the minority class. If I then want to calculate a (for example) diff in props statistic from these samples I end up with many NaN values. I can easily drop these NaN samples from my analyses, in fact, the get_ci and visualise functions do this automatically, but is makes me wonder if a stratified argument would be useful for the generate function.

I hope the package maintainers or authers could weight in on the question above and my related question.

I added a code-example below.

library(dplyr)
library(moderndive)
library(infer)

set.seed(123)

promo_fem <- promotions |> 
  filter(gender == "female") |> 
  slice_sample(n = 3)

promo <- promotions |> 
  mutate(gender = "male")

promo <- bind_rows(promo, promo_fem)

table(promo$gender, promo$decision)

promo_bootstrap <- promo |> 
  specify(decision ~ gender, success = "promoted") |> 
  generate(5000, type = "bootstrap") |> 
  calculate("diff in props", order = c("male", "female"))

promo_bootstrap |> 
  filter(is.nan(stat)) |> 
  nrow()

promo_bootstrap_ci <- promo_bootstrap |> 
  get_confidence_interval()

visualise(promo_bootstrap) +
  shade_ci(promo_bootstrap_ci)

pietervreeburg avatar Mar 25 '24 10:03 pietervreeburg

Related to #503, #197. :)

simonpcouch avatar Mar 25 '24 15:03 simonpcouch