aqa-test-tools
aqa-test-tools copied to clipboard
Verify sensitivy of Glitchwitcher REPD approach by inserting mutations into draft PRs
Now that there is a way to run the REPD approach against a set of changed files (in a PR), let us also check how 'sensitive' this approach is to different types of mutations and code changes that we orchestrate.
Let's select various files from the OpenJ9/OpenJDK repos and make our own "bad changes" by applying several mutators from this list: https://pitest.org/quickstart/mutators/ to see what % score the draft PR receives. pitest generates mutations for Java code, but many of these mutations can just be manually applied to any codebase. One could also create a draft PR that applies several mutations across several files.
| Mutator(s) | File(s) mutated | Resulting % score |
|---|---|---|
| Increments Mutator | https://github.com/eclipse-openj9/openj9/blob/master/runtime/gc_base/ContinuationObjectList.cpp#L58 | TBD |
| Negate Conditionals Mutator | https://github.com/eclipse-openj9/openj9/blob/master/runtime/j9vm/javanextvmi.cpp#L228 | TBD |
| ... | ... | ... |
TL;DR
- I injected a range of PIT-style mutations (boundary, logical, math, returns, side-effect removal) into anirudhsengar/OpenJ9 via PRs, then ran the REPD approach against those PRs.
- Mutations that short-circuit behavior or remove side effects (e.g., Empty returns, Void method call removal, forcing returns) cause the largest score increases, meaning REPD is most sensitive to structural/semantic breakages that obviously degrade correctness and resource handling.
- Subtle control-flow adjustments (boundary flips, increments, negations, math tweaks) move the needle only slightly.
Results
Notes:
- “Defective/Non-Defective % Change” reflects how the REPD score moved on those classes after the mutation relative to baseline on the same files.
What the results suggest about REPD sensitivity
- Biggest signals: removing behavior and short-circuiting flow
- Empty returns (+21.19% / +55.36%): Early-exiting methods (e.g., returning
NULL/nullptror returning prematurely) produces large, consistent structural damage - REPD strongly flags these. - Void Method Call removal (+11.54% / +18.08%): Eliminating calls that have side effects (e.g., cleanup/close, permission checks, tracepoints, synchronization) materially alters program semantics. REPD consistently treats this as high risk.
- Forced return constants (True/Null/Primitive returns; Return Values): These increasingly “freeze” dynamic paths and error handling. The higher the probability of suppressing failures or misreporting state, the larger the REPD bump (up to +8.89% for primitive returns).
- Moderate signals: blatant control-flow forcing
- True/False returns (esp. True returns), and Null/Primitive return families: These mutations steer code into atypical paths or failure modes (e.g., reporting success, skipping checks, returning invalid pointers), which REPD catches as materially riskier than baseline.
- Small/no signal: micro-control-flow and arithmetic tweaks
- Conditionals boundary, increments, negate conditionals, negatives, math. REPD tends to treat them as low-risk noise, hence tiny score deltas.
Methodology (high level)
- For each mutator, I created a PR in my OpenJ9 fork that applied representative mutations across 10 files.
- I then ran the REPD approach against each PR to compute how its scoring changed on the affected classes compared to baseline.
- I aggregated the deltas to the “Average % Change” values shown above.
Takeaway
- REPD is most sensitive to mutations that remove behavior or force outcomes (empty/void-return changes, hardcoded returns).