ion-java icon indicating copy to clipboard operation
ion-java copied to clipboard

Improve reliability of the regression detection workflow

Open tgregg opened this issue 1 year ago • 3 comments

Currently the performance regression detection workflow executed via GitHub Actions is unreliable due to high variance in the results. We suspect we can drive down this variance by executing the regression detection workflows on hardware that we can guarantee is reserved for these workflows, and that executes the workflows serially.

Below are some options that have been discussed:

  1. Configure GitHub Actions to submit the jobs to reserved AWS hardware that we control. This is the more complicated of the two options listed here, but has the benefit of not shifting any burden onto the PR requester.
  2. Change the workflow so that it verifies reports uploaded by the requester of a PR, creating a build phase that executes the regression detection workflow (before/after runs) on the requester's hardware. This is the simpler option, but has the drawback that submitting a PR becomes a bit more onerous.

tgregg avatar Feb 28 '24 20:02 tgregg

In #746, I was having a hard time reproducing my results consistently even though I was always running the tests serially using the same hardware. I had other processes running at the time (which would also be the case for option 2), but I was running the tests in a single JVM using a single core of my 8-core M1 Pro CPU. It's unclear to me whether HotSpot optimizations are applied deterministically (i.e. will two runs of the same program with the same inputs result in the same HotSpot optimizations), and that may be a confounding factor here.

TLDR; dedicated hardware certainly can't hurt the test reliability, but it might not improve it either.

popematt avatar Feb 28 '24 22:02 popematt

GitHub Action runner in AWS CodeBuild: https://docs.aws.amazon.com/codebuild/latest/userguide/action-runner.html

tgregg avatar Feb 28 '24 22:02 tgregg

The regression detector came up after a point release caused a substantial performance regression, right? How substantial was that, are we overtuned here? Are we trying to detect any regression at all or only prevent disastrous regression?

Have we considered some approach like JProffa, which measures byte codes executed instead of wall clock time?

If contention is impacting, could we try to control for that by making both halves of the comparison run concurrently? That will make contention even worse but it ought to effect both sides of the split evenly.

jobarr-amzn avatar Feb 28 '24 23:02 jobarr-amzn