vanity icon indicating copy to clipboard operation
vanity copied to clipboard

Vanity should not encourage "peeking" at experiment results

Open phillbaker opened this issue 12 years ago • 3 comments

Based on this article on A/B testing (and other similar articles), for a traditional A/B experiment:

  • sample sizes should be decided before starting an experiment
  • significance of the result should be calculated only after the sample size is reached
  • stopping an experiment if significance is reached, but not the required sample size, is bad statistics

Some things Vanity should do better:

  • document what method is being used to determine significance (Z-test, but other methods like the G-test are possible)
  • don't facilitate early ending of an experiment if sample sizes haven't been reached, as #146 might
  • warn users about the statistical issues with early experiment ending, with a link to the blog post above

Potential features that might help:

  • make the bayesian strategy default, instead of z-score, with a fixed time to run the experiment
  • have target sample size as part of experiment definition (oar define baseline rates and do the calculation for sample size)
    • maybe this includes helping to determine baseline conversion rate (run the experiment without alternatives until an estimate of baseline rate with a calculated confidence interval is determined: http://stats.stackexchange.com/a/38737)
  • on the dashboard, use this sample size calculation to disallow experiment completion until sample size is reached
  • report how large of an effect can be detected given the current sample size: delta = (t_alpha/2 + t_beta) * sigma * sqrt(2/n), t-statistics for a given significance level α/2 and power (1−β)
  • offer an alternative to watching for significance: a gauge of how complete the experiment is

phillbaker avatar Nov 04 '13 01:11 phillbaker

Also see http://nerds.airbnb.com/experiments-at-airbnb/, the section "How long do you need to run an experiment?" where two graphs, effect over time (e.g. conversion rate delta over time) vs. p-value over time where the experiment should not be stopped until the effect delta has stabilized.

phillbaker avatar Jul 08 '14 02:07 phillbaker

Also see http://pages.optimizely.com/rs/optimizely/images/stats_engine_technical_paper.pdf for what Optimizely is doing to combat some of these problems.

phillbaker avatar Jan 13 '16 13:01 phillbaker

Also worth looking at:

  • https://github.com/auduno/seglir/ (blog post: auduno.com/post/106141177173/rapid-ab-testing-with-sequential-analysis).
  • https://www.chrisstucchio.com/pubs/slides/gilt_bayesian_ab_2015/slides.html
  • https://www.analytics-toolkit.com/pdf/Issues%20_with_Current_Bayesian_Approaches_to_AB_Testing_in_Conversion_Rate_Optimization_2017.pdf
  • https://www.google.com/patents/US9760471

phillbaker avatar Jan 18 '16 00:01 phillbaker