Horreum icon indicating copy to clipboard operation
Horreum copied to clipboard

Change detection improvements

Open rvansa opened this issue 4 years ago • 10 comments

A pseudo-issue tracking ideas for improvements in regression monitoring

rvansa avatar Sep 18 '20 12:09 rvansa

When the test contains sequential data (e.g. throughput per-second), rather than just averaging these for comparison with other run we could treat the sequence as a multi-dimensional vector and compute distance between those vectors.

When the sequences are not of equal length we could just pad the missing dimensions with average value. Truncating the longer vector is another option but that would lose information.

rvansa avatar Sep 18 '20 12:09 rvansa

The critical problem with current approach using T-test for each variable is that with growing number of variables that are not independent the likelihood of false positives in at least one of them is growing. It's useful to incorporate more performance counters into the dataset but clustering them by covariance could have better properties.

rvansa avatar Sep 18 '20 13:09 rvansa

I made an experiment with the current algorithm: I've generated 200 dummy runs using ~normal distribution (10 + sum of 5 random() ) and let the algorithm do its job. It generated 25 changes: that's a clear evidence that the approach is flawed.

rvansa avatar Sep 24 '20 15:09 rvansa

Actually, when using the minWindow = 5, and trying the 2 * stddev test, I got 10 changes, and in the t-test I got 9 changes. These numbers roughly fit the expected population outside mean +- 2*stdev (4.55%) or confidence levels (5% chance of rejecting null hypotesis while it holds).

We get what we ask for from the statistics, even though we wish for no false positives.

rvansa avatar Sep 24 '20 15:09 rvansa

@johnaohara Thinking about comparing histograms, I think it could be done, and the method could be used for any constant-size vector: we could average the values to obtain the baseline and then calculate square root of difference in each item and average these. Then the regular thresholds would apply.

I am not sure how useful this would be in practice but it is something that makes sense to try and is not possible currently (you could compare each vector item but you'd probably need higher thresholds - it is not possible to diff each vector item first now).

The UI would not need to be more complicated: this could be a default for any regression var returning and array. If the vector size differs, though, the comparison would fail and notification would be sent. Charting would be a bit more difficult to do: I can imagine interactive time axis, using the whole chart to display just the single histogram, with gray 'average histogram' in the background. Optional log scale? (with some primitive heuristic choosing the default). If users choose to plot data that have completely different scales such chart wouldn't be too useful, but hey, they can normalize them in the calculation function (not affecting the regression algorithm at all).

rvansa avatar Sep 07 '21 11:09 rvansa

If we decide to adopt some form of statistical tests again we should use https://en.wikipedia.org/wiki/Holm%E2%80%93Bonferroni_method to compensate for the multiple comparisons.

rvansa avatar Sep 30 '21 14:09 rvansa

When monitoring performance on a branch it might be that a regression is introduced, and later on it's fixed. Horreum does not let us confirm if the performance after the fix is equal to the one before the regression.

rvansa avatar Mar 09 '22 15:03 rvansa

Worth reading paper on change detection: https://arxiv.org/pdf/1101.1438.pdf

rvansa avatar Jul 18 '23 12:07 rvansa

Hey @rvansa hope you are doing well! thanks for the link to the paper, will take a look

johnaohara avatar Jul 18 '23 12:07 johnaohara

Hi John, yep, except no AC in my home office :) I've actually found the paper when I've stumbled upon this python library: https://centre-borelli.github.io/ruptures-docs/

rvansa avatar Jul 18 '23 12:07 rvansa