Use batched inversion ("Montgomery's trick") to compute inverses during proving
This is tricky because the inversions are scattered through the circuit implementation. To avoid breaking modularity, we would need to automatically collect elements to be inverted without disrupting the flow of the circuit code. The best way I can think of to do that would be to use a future to represent the result of inversion. But first we should benchmark how much of the proving time is spent calculating inversions, to see whether this is worthwhile.
A futures-based "lazy synthesis" design is something I'm very interested in. (Would allow us to discard unused constraints in BLAKE2s, for example.)
The problem with batch inversion is that we often immediately need the inversion result in order to even continue with witness construction, but this can be resolved through futures+tree-based designs of circuits. (At the point when synthesis 'hangs' waiting for inversion results at some layer of synthesis, we know to perform the batch inversion.)