vanity Vanity should not encourage "peeking" at experiment results

Based on this article on A/B testing (and other similar articles), for a traditional A/B experiment:

sample sizes should be decided before starting an experiment
significance of the result should be calculated only after the sample size is reached
stopping an experiment if significance is reached, but not the required sample size, is bad statistics

Some things Vanity should do better:

document what method is being used to determine significance (Z-test, but other methods like the G-test are possible)
don't facilitate early ending of an experiment if sample sizes haven't been reached, as #146 might
warn users about the statistical issues with early experiment ending, with a link to the blog post above

Potential features that might help:

make the bayesian strategy default, instead of z-score, with a fixed time to run the experiment
have target sample size as part of experiment definition (oar define baseline rates and do the calculation for sample size)
- maybe this includes helping to determine baseline conversion rate (run the experiment without alternatives until an estimate of baseline rate with a calculated confidence interval is determined: http://stats.stackexchange.com/a/38737)
on the dashboard, use this sample size calculation to disallow experiment completion until sample size is reached
report how large of an effect can be detected given the current sample size: delta = (t_alpha/2 + t_beta) * sigma * sqrt(2/n), t-statistics for a given significance level α/2 and power (1−β)
offer an alternative to watching for significance: a gauge of how complete the experiment is

Nov 04 '13 01:11 phillbaker

Also see http://nerds.airbnb.com/experiments-at-airbnb/, the section "How long do you need to run an experiment?" where two graphs, effect over time (e.g. conversion rate delta over time) vs. p-value over time where the experiment should not be stopped until the effect delta has stabilized.

Jul 08 '14 02:07 phillbaker

Also see http://pages.optimizely.com/rs/optimizely/images/stats_engine_technical_paper.pdf for what Optimizely is doing to combat some of these problems.

Jan 13 '16 13:01 phillbaker

Also worth looking at:

https://github.com/auduno/seglir/ (blog post: auduno.com/post/106141177173/rapid-ab-testing-with-sequential-analysis).
https://www.chrisstucchio.com/pubs/slides/gilt_bayesian_ab_2015/slides.html
https://www.analytics-toolkit.com/pdf/Issues%20_with_Current_Bayesian_Approaches_to_AB_Testing_in_Conversion_Rate_Optimization_2017.pdf
https://www.google.com/patents/US9760471

Jan 18 '16 00:01 phillbaker