Context

It is very useful when running ab test to see the evolution of the difference / pvalues / credible interval / etc. through time. For instance if I start an experiment on 2018-04-01, and finish it on 2018-04-30, I would like to know what was the state (in term of pvalue, etc.) each day. It helps to visualize if the test has "converged" or not. airbnb (source : https://medium.com/airbnb-engineering/experiments-at-airbnb-e2db3abf39e7 )

Proposition

Would it be possible to apply sequentially the statistical analysis date by date (it could apply the analysis to the sequence [df[df.date <= dt.datetime(2018-04-01) + dt.timedelta(days=i)] for i in range(30)], and then report the same json, but with a date level at the top. (Maybe there is a much cleaner architecture than this !)

Thanks

Mar 19 '18 19:03 alexisrosuel

Dear Alexis @alexisrosuel,

thanks a lot for the suggestion. What you're talking about here seems to me an instance of the 'early stopping' problem to me and is a subject of the multiple hypothesis testing issues. The more often you look at your p-value, the higher the probability to see spurious significance by chance.

Expan kind of supports early stopping in a highly experimental mode and tries to mitigate the risk of spurious early stopping by applying a stricter p-value threshold when there is less data than expected. But it always consume all the date which is present in the dataframe.

Let me know if I understood your question correctly.

Best, Grisha

Mar 21 '18 08:03 gbordyugov

Hi Grisha,

In fact the idea behind this chart (and the whole airbnb medium article) is the opposite. They wanted to point out that the pvalue can fluctuate through time, go below the signifiance threshold, and then stay there forever or not.

The chart show this : if you stop the experiment represented here around day 10, you commit type 1 error. But I you let the experiment run for a few more days, you see that the pvalue in fact "converges" around its true value.

To recap, this does not provide an early stopping criteria. This helps to monitor wether the pvalue has still an erratic behaviour (so we can't stop the experiment at this moment), or if it hasn't changed sinced a "long time" (to be defined). For me the ideal criteria is :

look at the true statistical early stopping criteria (the aim of this package)
accept this results iff the pvalue graph has converged

What do you think of it?

Mar 21 '18 08:03 alexisrosuel

Please pardon my poor expression: What I meant in my first reply is exactly what you're talking about

The more often you look at your p-value, the higher the probability to see spurious significance by chance.

Our early stopping logic counteracts the effects like this by reducing the alpha-threshold at the beginning of experiment (where you've got less data), so it's not 0.05, but much larger for small quantities of data in the first days.

Mar 21 '18 08:03 gbordyugov

Oh indeed I see your point now too :)

Yes, expan use some kind of "dynamic pvalue threshold", so we could draft this value day by day, along with the observed pvalue?

Mar 21 '18 10:03 alexisrosuel

Yes the "dynamic threshold" is based on information fraction, which is ratio of current sample size and estimated sample size for the experiment.

Here is the method we use: https://github.com/zalando/expan/blob/master/expan/core/early_stopping.py#L24-L36

Mar 21 '18 10:03 shansfolder

Whether it is day-by-day analysis or other periods, will depends on how your code calls ExpAn.

Mar 21 '18 10:03 shansfolder

expan
expan copied to clipboard

Test differences date by date

Context

Proposition

expan expan copied to clipboard

Test differences date by date

Context

Proposition

expan
expan copied to clipboard