rustc-perf Significance Factor is flawed

Currently there is a PR for Rust, rust-lang/rust#97293, which I suspect is experiencing significant noise due to the way rustc-perf calculates significance. After reading about how the significance factor is calculated, I believe it is severely flawed in multiple ways.

First, the relative change from a single measurement of the previous commit to a single measurement of the new commit is extremely sensitive to noise. At the moment there is no way to reliably tell whether a change in measurements between two commits is actually significant or if it's an outlier due to noise, just by looking at the raw data.

Next, since the quartiles of historical data are entirely unweighted, any prior commit that significantly affects performance will alter the quartiles and thus the significance factor in a very severe and problematic way. For example, let's say we are using the previous 100 commits as our historical data, and we are considering a very low-noise test. A new commit increases compile times by ~1% across the board. At first, there will be no change to the quartiles; then, after 25 new commits, Q3 will suddenly sharply increase, and only after 75 commits, Q1 will suddenly sharply increase to match.

If you do any perf run before 25 commits, the quartiles will be outdated (but probably still reasonable).
If you do any perf run between 25 and 75 commits, the Q3 will sharply increase, the IQR will be hugely inflated, and the significance factor will be underestimated. If the the new commit instead decreased compile times by ~1%, the IQR would be hugely deflated and the significance factor would be overestimated.
Only after 75 commits, everything somewhat returns to normal.

(This effect is even worse for commits that are exactly 25 or 75 commits later, e.g. when the quartiles sharply change on the border of the merge base and the PR commit). I hope you agree that these issues are severe and need to be fixed.

Due to these issues, I would like to propose an entirely new method instead:

Instead of generating one perf reading for each new commit, generate a number of perf readings for each nightly and use the most recent nightly as the merge base for the PR. Assume the data is normally distributed and calculate the mean/variance.
Do a perf run for the new commit. Assume that the variance under the new commit is the same as the variance of the nightly commit.
Find the smallest two-sided prediction interval containing the new measurement, and report the percentage as the result. (Under the given assumptions, this is equivalent to calculating the P-value of a two-tailed hypothesis test).

Pros: much much more accurate

Uses only data from a single prior commit; completely unaffected by historical data.
Any significant changes are reflected immediately and completely.
Less prone to noise (thanks to a far more accurate method of quantifying variance).
I'm not a statistician, but I believe this also has a stronger mathematical backing than the current method.

Cons:

PRs must be rebased onto the latest nightly instead of master. (is this actually a pro?...)
May require a higher overall number of perf runs, though this is configurable.

Side notes:

If the sample size for each nightly is small, perhaps variance (and maybe mean as well) should be calculated using some variation of a high-alpha exponential moving average; and/or the variance should be considered "known" for the purpose of the prediction interval.
Could also generate multiple measurements for each requested perf test, which would provide even better results but would obviously be slow/expensive.

Anyway, I am new to this project, so I would love to hear any comments and feedback!

May 31 '22 07:05 jacob-greenfield

Thanks for the feedback! Overall, these definitely seems like a good proposal. I'm just not sure how difficult it would be to implement. Some thoughts and questions:

I suppose every day there would need to be an additional perf run for the nightly artifact. This doesn't seem like it would be too big of a burden and could even be useful outside of this proposal.

the most recent nightly as the merge base for the PR

This might cause some headaches, but I'm not sure. Rebasing over master tends to be easier as that's the state the PR must be in in order to merge. This also means that interactions between the PR and the commits that occurred after the nightly was released will not be reflected in performance measurements.

What would we do with post merge perf runs?

I hope you agree that these issues are severe and need to be fixed.

I agree that the methodology we're currently using is far from ideal. I would love for us to have a better solution, but we've yet to be able to come up with one yet that's easy to implement and improves on the situation.

If the sample size for each nightly is small, perhaps variance (and maybe mean as well) should be calculated using some variation of a high-alpha exponential moving average; and/or the variance should be considered "known" for the purpose of the prediction interval.

I'm not sure I follow this. Can you explain more?

May 31 '22 14:05 rylev

For analysis of merged PR's maybe changepoint analysis would work?

May 31 '22 14:05 bjorn3

To clarify, the crux of the change is: instead of using historical data from the past 100 commits, gather multiple perf runs for only the direct merge base. Originally I had considered ways to denoise the current quartile-based method (e.g. normal distribution based on an exponential moving average, outlier removal, etc.), but it still is vulnerable to misleading results for the first few commits after a significant change. Because of that, if we want the most accurate results, I think historical data prior to the merge base should not be included at all (with the exception of variance).

I suppose every day there would need to be an additional perf run for the nightly artifact.

Yes, ideally multiple additional perf runs; the more the better.

Rebasing over master tends to be easier as that's the state the PR must be in in order to merge.

True, but the difference is only between the latest nightly and master. If you need to, you could wait 1 day until the commits you rely on have been included in the nightly and perf-tested. Or in very rare situations, if you really really need to, I guess the nuclear option is to close the tree for a while. I think there's some precedent for this anyway, unrelated to perf-testing.

This also means that interactions between the PR and the commits that occurred after the nightly was released will not be reflected in performance measurements.

Since we would be be building every nightly multiple times, this also arguably makes it less important to perf-test every single PR, so perhaps some PRs that are not expected to alter performance (e.g. rollups) be skipped in order to balance out the cost of running the builders. Obviously ideally every single commit would be perf-tested many times, but I am trying to consider which are the most important/relevant tests in order to be efficient. However, if there is a larger budget for perf-testing, then probably the best idea would be building the nightlies multiple times in addition to everything that is already being done.

If there is an unexpected change between two nightlies, you can always just pick a test that changed and bisect locally. I admit it does make things more difficult but IMHO it's easily worth the tradeoff of more accurate perf runs for PRs. I suppose that's for the maintainers to decide though.

What would we do with post merge perf runs?

Per above certain commits may not have as much of a need for post-merge perf runs. For those that do, I guess they could be included in the graphs but I feel strongly that the historical data (possibly modulo variance) should be based solely on the merge base, not any prior commits, which is why I am suggesting using each nightly as a merge base and profiling it multiple times to gather data, rather than using quartiles from the past 100+ commits which is affected by how "recent" any prior perf changes are. This is the major improvement over the current method.

Originally I was considering just switching from IQR to a normal distribution based on an exponential moving average. This is still a possibility, and is pretty trivial to implement, and it improves things a bit but can still result in inaccurate results for the first few commits directly following a significant perf change. For that reason, I think it is better to test based only on the merge base, and use multiple run.

I would love for us to have a better solution, but we've yet to be able to come up with one yet that's easy to implement and improves on the situation.

I'm not totally sure how hard this would be to implement. The actual calculations are definitely annoying but doable, I guess I am not familiar enough with the project to know how hard it would be to implement the other changes. Out of curiosity, what are some prior suggestions?

For analysis of merged PR's maybe changepoint analysis would work?

Perhaps. However rather than changepoint analysis or anomaly detection, I believe this problem is better represented by a Student's t-test with a similar variance: given a few samples of the merge base and one sample of the PR, for each crate/test find the P-value for the null hypothesis that there is no change. This is the same as calculating the prediction interval, see also the StackExchange answer I linked before.

I also think there is room for other improvements as well, especially some way to control the family-wise error rate, but fixing the significance factor is more important.

May 31 '22 20:05 jacob-greenfield

To clarify what I meant about using an exponential moving average for the variance: I think it's a reasonable assumption that the variance/noisiness tends not to change much over time, and we might get more accurate results if we do base the variance on historical data (with a much higher weight given to more recent values, ie. high alpha value), but I strongly feel that basing the average on any historical data prior to the merge base at all is a mistake.

May 31 '22 20:05 jacob-greenfield

perhaps some PRs that are not expected to alter performance (e.g. rollups) be skipped in order to balance out the cost of running the builders

Alas, it is quite common for such PRs to alter performance, contrary to expections.

May 31 '22 22:05 nnethercote

Over the past 60 days I could only find 4 instances where performance was significantly altered by a merged PR that did not already have a manually-triggered perf run. I believe this is an acceptably low level, and IMHO it is worth the tradeoff. I think that today it may seem that PRs alter performance more often than they really do, partly because of spurious results due to this very issue!

In fact, I wonder if an overall better alternative to using nightlies could be: generate the average based on the previous ~5 commits, but add a system to mark PRs as affecting performance. For any such PR, during the pre-merge r+ test, trigger multiple perf runs and once merged, include all of them in the graph sequentially as separate data points with an indicator marking that they are for the same commit. Compared to my original proposal, this would be much simpler to implement, much more efficient, and only slightly less accurate; I think that's definitely worth it.

However it also re-raises the existing issue of what to do when a commit unexpectedly introduces a significant perf change. I strongly feel that in order for perf tests to be reasonably accurate, you need at least a few data points of consistent data up to and through the merge base before you can decide whether any new changes are significant. Any merged PRs that unexpectedly alter performance (and thus do not have enough data points already gathered) would violate this principle.

Due to this, I would like to suggest that perf tests should be enabled by default as part of the r+ process rather than post-merge; if any significant change is found, the merge should fail. The author must then either fix any regressions, or e.g. tag the PR as perf-significant, triggering more extensive profiling per above. I recognize that this is no small change and would need much further discussion, but regardless of reimplementing the significance factor, I believe this is a really great opportunity for a simple and major improvement. Curious to hear what other people think about this.

Jun 01 '22 03:06 jacob-greenfield

@rylev I'm trying to test this out locally but the data downloaded by cargo run --bin fetch-latest at https://perf-data.rust-lang.org/export.db.sz seems to be outdated and not contain any artifacts, and I think as a result of this historical_data seems to always be None inside TestResultComparison::significance_threshold(). Where can I download a good results database?

Jun 04 '22 03:06 jacob-greenfield

I've just refreshed the export; it's currently a manual process to do so. The new export should be up to date and have ~all historical data, but omits the self profile query table (since that would greatly increase its size for no strong benefit).

Jun 05 '22 00:06 Mark-Simulacrum

@Mark-Simulacrum Thanks but it still looks like I'm getting the old database, am I downloading from the right link? Maybe it's just the S3 cache or something.

Also the site is supposed to work fine (i.e. with all features, just slower) without the self profile query table correct?

Jun 05 '22 01:06 jacob-greenfield

The detailed query pages won't be available, but everything else should work, I think. I've invalidated the cloudfront cache on the bucket; please try again.

It'll probably be faster or on-par with the hosted site; I wouldn't expect it to be slower, presuming you have an SSD and/or some RAM free to cache the file in-memory.

Jun 05 '22 01:06 Mark-Simulacrum