z3 Experiments in Semi-automatic Software Dev

@NikolajBjorner , @pelikhan and I are running experiments in semi-automatic software repo maintainence and dev n this repository for a few days

Semi-automatic test coverage improvement
Semi-automatic perf improvement
Semi-automatic backlog burner

We expect Test Coverage improvements to be fairly routine and positive. Maybe 80% successful PRs merged.

We expect the others to be much less useful, but are running them anyway to see what comes up and for the depth research, and to work on having the agents improve the evidence collected.

Maintainers and admins can give guidance by commenting on the planning done by the agents

Maintainers and admins please make comments on PRs and Planning issues about

Are plans looking in the right areas and are there priorities
Are PRs acceptable (probably not in many cases)
Are PRs of any interest at all, e.g. as a draft or reference point or insight into possibility
What might be needed to make something convincing
Would there be a better way to achieve the same effect

Sep 17 '25 09:09 dsyme

@nunoplopes Please feel very welcome to continue to make assessments of the Perf and Backlog PRs, and we can close them out.

The test coverage ones should generally be acceptable.

A low or zero PR-acceptance rate on the perf/backlog ones is still super useful. It would be great to have a reason attached to each one. Also any assessment "this is/isn't useful information/draft but we can't accept the PR" is also of interest. "Total rubbish" totally ok too :)

Sep 17 '25 15:09 dsyme

Here are my preliminary observations:

The big deal for me with copilot in the portal is that it offers low touch issue handling. The pass rate isn't stellar, but enough to make the passing instances save time. These agentic workflows have a related flavor. The first wave overwhelmed the build actions with a long backlog, but it has mellowed down.
test improver: It is generally good to have this. Some more iterations would be useful:
- we should tweak the instructions to take into account the coverage offered by the example programs (C/C++) that exercise the API. API unit tests can be added there too. The unit tests that cover the API are not too interesting, but they can catch some regressions (and add some overhead to development).
- I tend to think of unit testing internal (not exposed over API) features as more useful because it is more painful to set up scaffolding manually. But automation has the potential to get over that hump. Example: unit and perf testing different ways to multiply polynomials.
perf improver: good to have the TA tell what else is out there of basic algorithms, such as string hashes and hash table layout. There was also a bunch of deceptive PRs with low quality changes wrapped in a deceptive spew of misunderstandings. We can perhaps tell it to run perf testing by timing z3 on all files in z3test/regressions/smt2.
build doctor: this didn't really work yet. The low touch workflow I could use is:
- create a PR with build fix. Iterate on the PR if that itself doesnt fix.
- pick up build warnings and address those with PRs. I try to fix warnings periodically, but it is a distraction. The advantage I derive from copilot nowadays is low touch automation. It saves manually pulling, building, testing, to make small changes.
I have been dreaming about a coding self-improver - doctor. It could use information from previous pushes to apply changes I made in one module and derive analogies for other places in the code base. I don't see how to do this in a simple way because there are also a lot of code updates that are not about self-improvement. It could also use a manifest of general guidance (an NL spec because I am too lazy/dumb to write scripts).
NL lint enforcement. Suppose we have a bug of the form, https://github.com/Z3Prover/z3/issues/7859. This is not the first time this kind of bug was reported. We addressed it by painfully setting up custom logging to pinpoint locations of nested state-full function calls. Gabriel showed me how to run CodeQL to get a rough over-approximation of other locations in z3 that could be in scope of fixes. Ideally, the agents pull in CodeQL for you (assumes it is set up to play well with agents, but this is the general trend isnt it?) and provides a query that you can refine with NL, and analysis, and fixes. I guess, a lot has to happen for this scenario to be solid. I believe there can be some initial value in getting a novice in tools onboarded.
There are some other bugs that come in where a first step is to establish root cause by git bisection. The bug filer doesn't do this and leaves the work up to maintainers and then this bug gets lower priority. A lower touch environment for regression root cause analysis (using git bisect)?
Assertion annotation. Some preliminary experiments with assertion annotations suggested that there is some noise level with AI generated assertions. It is perhaps like the perf improver that improvises deceptively looking changes to score points. Another takeaway from preliminary experiments were that new assertions could take too long to run, so were simply impractical. That said, the format of agent workflows may be a very good "home" for specialized assertion annotation contracts.

Sep 21 '25 08:09 NikolajBjorner