handson-ml2 icon indicating copy to clipboard operation
handson-ml2 copied to clipboard

Chapter 2 about the stratified sampling example

Open panghucui opened this issue 4 years ago • 1 comments

In chapter 2, there is a stratified sampling: For example, the US population is composed of 51.3% female and 48.7% male, so a well-conducted survey in the US would try to maintain this ratio in the sample: 513 female and 487 male. This is called stratified sampling: the population is divided into homogeneous subgroups called strata, and the right number of instances is sampled from each stratum to guarantee that the test set is representative of the overall population. If they used purely random sampling, there would be about 12% chance of sampling a skewed test set with either less than 49% female or more than 54% female. Either way, the survey results would be significantly biased.

The question is: could you give some explanation how these 12%, 49%, 54% calculated? thanks

panghucui avatar Nov 04 '21 15:11 panghucui

Hi @panghucui ,

Thanks for your question.

I chose 49% and 54% because they were the first round numbers more than 2% away from the real female ratio. I could have chosen 48% and 55%, or anything else, but making more than 2% error on a survey where gender matters seemed bad enough.

So the question is: how to compute the 12%?

The simplest way if you're not fan of statistics is to run a quick simulation:

import numpy as np

true_female_ratio = 0.513
sample_size = 1000

samples = np.random.rand(100_000, sample_size)
females_per_sample = (samples <= true_female_ratio).mean(axis=1)
bad_samples = (females_per_sample <= 0.49) | (females_per_sample >= 0.54)
print(bad_samples.mean())

This code generates 100,000 random samples, each of size 1000. Then for each sample it computes the ratio of females. Then it computes the ratio of samples where the female ratio is abnormal (≤49% or ≥54%). If you run this code, it will generally output a probability close to 12.4%. Of course this is just an approximation.

Now if you want to compute this mathematically, you first need to know that when you sample 1,000 people from a population with a female ration of 51.3%, the number N of females in that sample follows a binomial distribution. We can use scipy.stats.binom(1000, 0.513) to get this distribution, and the cdf(n) method gives the probability that the number of females in the sample will be ≤n. So the following code gives the same answer as above, but mathematically instead of via a simulation:

from scipy.stats import binom

true_female_ratio = 0.513
sample_size = 1000
distrib = binom(sample_size, true_female_ratio)
proba_low = distrib.cdf(490)
proba_high = 1 - distrib.cdf(539)
print(proba_low + proba_high)

Hope this helps.

ageron avatar Nov 06 '21 04:11 ageron