Vanity should not encourage "peeking" at experiment results
Based on this article on A/B testing (and other similar articles), for a traditional A/B experiment:
- sample sizes should be decided before starting an experiment
- significance of the result should be calculated only after the sample size is reached
- stopping an experiment if significance is reached, but not the required sample size, is bad statistics
Some things Vanity should do better:
- document what method is being used to determine significance (Z-test, but other methods like the G-test are possible)
- don't facilitate early ending of an experiment if sample sizes haven't been reached, as #146 might
- warn users about the statistical issues with early experiment ending, with a link to the blog post above
Potential features that might help:
- make the bayesian strategy default, instead of z-score, with a fixed time to run the experiment
- have target sample size as part of experiment definition (oar define baseline rates and do the calculation for sample size)
- maybe this includes helping to determine baseline conversion rate (run the experiment without alternatives until an estimate of baseline rate with a calculated confidence interval is determined: http://stats.stackexchange.com/a/38737)
- on the dashboard, use this sample size calculation to disallow experiment completion until sample size is reached
- report how large of an effect can be detected given the current sample size: delta = (t_alpha/2 + t_beta) * sigma * sqrt(2/n), t-statistics for a given significance level α/2 and power (1−β)
- offer an alternative to watching for significance: a gauge of how complete the experiment is
Also see http://nerds.airbnb.com/experiments-at-airbnb/, the section "How long do you need to run an experiment?" where two graphs, effect over time (e.g. conversion rate delta over time) vs. p-value over time where the experiment should not be stopped until the effect delta has stabilized.
Also see http://pages.optimizely.com/rs/optimizely/images/stats_engine_technical_paper.pdf for what Optimizely is doing to combat some of these problems.
Also worth looking at:
- https://github.com/auduno/seglir/ (blog post: auduno.com/post/106141177173/rapid-ab-testing-with-sequential-analysis).
- https://www.chrisstucchio.com/pubs/slides/gilt_bayesian_ab_2015/slides.html
- https://www.analytics-toolkit.com/pdf/Issues%20_with_Current_Bayesian_Approaches_to_AB_Testing_in_Conversion_Rate_Optimization_2017.pdf
- https://www.google.com/patents/US9760471